From flagos-skills
Generates and optimizes GPU kernel operators for Python/Triton, FlagGems, and vLLM repositories. Auto-detects repo type, dispatches sub-skills for MCP-based optimization, platform specialization, and feedback.
npx claudepluginhub flagos-ai/skills --plugin flagos-skillsThis skill is limited to using the following tools:
This is a **unified entry point** that bundles generation and optimization sub-skills into one:
LICENSE.txtREADME.mdREADME_zh.mdkernelgen-generate-for-flaggems.mdkernelgen-generate-for-vllm.mdkernelgen-generate.mdkernelgen-mcp-setup.mdkernelgen-optimize-for-flaggems.mdkernelgen-optimize-for-vllm.mdkernelgen-optimize.mdkernelgen-specialize-for-flaggems.mdkernelgen-specialize.mdkernelgen-submit-feedback.mdOptimizes GPU training on consumer NVIDIA GPUs (8-24GB VRAM) using mixed precision, gradient checkpointing, XGBoost GPU acceleration, CuPy/cuDF migration, and torch.compile for OOM reduction and speedups.
Orchestrates end-to-end TLE development: writing/optimizing high-performance kernels, implementing API/verifier/lowering features, debugging correctness/performance issues with reproducible validation.
Analyzes PyTorch internals across Python, C++, CUDA layers via TorchTalk MCP server. Traces operators, call graphs, dispatch, impact of changes, and locates tests.
Share bugs, ideas, or general feedback.
This is a unified entry point that bundles generation and optimization sub-skills into one:
| Sub-skill file | Purpose |
|---|---|
| Generation | |
kernelgen-generate.md | Generate GPU kernels for any Python/Triton repository |
kernelgen-generate-for-flaggems.md | Specialized generation for FlagGems repositories |
kernelgen-generate-for-vllm.md | Specialized generation for vLLM repositories |
| Optimization | |
kernelgen-optimize.md | Optimize existing Triton kernels via MCP iterative optimization (general purpose) |
kernelgen-optimize-for-flaggems.md | Optimize Triton operators and integrate into FlagGems (3 modes: built-in/external/experimental) |
kernelgen-optimize-for-vllm.md | Optimize Triton operators and integrate into vLLM (with CustomOp registration) |
| Platform Specialization | |
kernelgen-specialize.md | Specialize Triton operators to target platforms (e.g., GPU → Ascend NPU) via MCP specialize_kernel |
kernelgen-specialize-for-flaggems.md | Platform specialization + FlagGems integration (4 modes: vendor-ops/vendor-fused/override-builtin/experimental) |
| MCP Configuration | |
kernelgen-mcp-setup.md | Check and auto-configure the kernelgen-server MCP service (URL built-in, user only provides Token) |
| Feedback | |
kernelgen-submit-feedback.md | Submit bug reports and feedback via GitHub or email |
All sub-skill files are located in the same directory as this SKILL.md file.
Before anything else, ensure the kernelgen-server MCP service is configured and ready.
Use the Glob tool to find kernelgen-mcp-setup.md in this skill's directory:
Glob: **/skills/kernelgen-flagos/kernelgen-mcp-setup.md
Then use the Read tool to read the matched file and follow its instructions exactly.
Use the Glob tool to check for project identity files in the current working directory:
Glob: pyproject.toml
Glob: setup.py
Glob: setup.cfg
Then use the Read tool to read whichever file exists. Determine the project name from
the file contents (e.g., name = "flag_gems" in pyproject.toml, or name='vllm' in setup.py).
Also use the Glob tool to check for characteristic directory structures:
FlagGems indicators (match ANY):
src/flag_gems/ directory existsflag_gems or flag-gems or FlagGemsimport flag_gems appears in test filesvLLM indicators (match ANY):
vllm/ directory exists at the repo root (with vllm/__init__.py)vllmcsrc/ directory exists alongside vllm/Based on the detection result, use the Read tool to read the appropriate sub-skill file from this skill's directory, then follow the instructions in that file exactly.
To locate the sub-skill files: They are in the same directory as this SKILL.md. Use the Glob tool to find the path:
Glob: **/skills/kernelgen-flagos/kernelgen-generate.md
Then use the Read tool to read the matched path.
Generation requests (user wants to create/generate a new operator):
| Detection Result | Action |
|---|---|
| FlagGems repository detected | Read kernelgen-generate-for-flaggems.md and follow it |
| vLLM repository detected | Read kernelgen-generate-for-vllm.md and follow it |
| Neither detected (or unknown) | Read kernelgen-generate.md and follow it |
Optimization requests (user wants to optimize an existing operator, mentions "optimize", "speedup", "improve performance"):
| Detection Result | Action |
|---|---|
| FlagGems repository detected | Read kernelgen-optimize-for-flaggems.md and follow it |
| vLLM repository detected | Read kernelgen-optimize-for-vllm.md and follow it |
| Neither detected (or unknown) | Read kernelgen-optimize.md and follow it |
Specialization requests (user wants to migrate/specialize an operator to a different platform, mentions "specialize", "migrate to Ascend/NPU", "platform migration"):
| Detection Result | Action |
|---|---|
| FlagGems repository detected | Read kernelgen-specialize-for-flaggems.md and follow it |
| Neither detected (or unknown) | Read kernelgen-specialize.md and follow it |
Feedback requests:
| Detection Result | Action |
|---|---|
| User reports a bug or requests feedback submission | Read kernelgen-submit-feedback.md and follow it |
Important rules:
mcp__kernelgen-mcp__generate_kernel MCP tool. Optimization uses
mcp__kernelgen-mcp__optimize_kernel, and platform specialization uses
mcp__kernelgen-mcp__specialize_kernel. NEVER generate Triton kernels, PyTorch
wrappers, or operator implementations yourself. If MCP is not configured, not reachable,
or fails after all retries, STOP and report the issue — do NOT fall back to writing code
manually.At any point during the workflow, if the user reports a bug, says something is broken, or asks to submit feedback about the skill:
kernelgen-submit-feedback.md from this skill's directory.# === Generation ===
# Generate a kernel operator (auto-detects repo type)
/kernelgen-flagos relu
# Generate with explicit function type
/kernelgen-flagos rms_norm --func-type normalization
# === Optimization ===
# Optimize an existing Triton kernel (auto-detects repo type)
# Just say "optimize the relu kernel" or "improve kernel performance"
# The skill will automatically dispatch to the right optimization sub-skill
# The skill will automatically:
# - Detect if you're in a FlagGems repo → use FlagGems-specific workflow
# - Detect if you're in a vLLM repo → use vLLM-specific workflow
# - Otherwise → use the general-purpose workflow
If you encounter any issues during generation, just say "submit feedback" or "report a bug" and the skill will guide you through the feedback submission process.