From fabric-skills
Diagnoses failed Spark jobs, unhealthy Livy sessions, and performance bottlenecks like OOM, shuffle spill, and data skew in Microsoft Fabric via read-only CLI triage.
npx claudepluginhub microsoft/skills-for-fabric --plugin skills-for-fabricThis skill uses the workspace's default tool permissions.
> **Update Check — ONCE PER SESSION (mandatory)**
Analyzes Fabric lakehouse data interactively via Livy API sessions using PySpark and Spark SQL for advanced analytics, DataFrames, cross-joins, Delta time-travel, and JSON data.
Troubleshoots Spark applications on AWS EMR, Glue, and SageMaker: analyzes PySpark/Scala job failures, identifies bottlenecks, provides code recommendations and optimizations.
Diagnoses and fixes common Databricks errors like cluster not ready, Spark OOM, Delta concurrent writes, using CLI and SDK commands.
Share bugs, ideas, or general feedback.
Update Check — ONCE PER SESSION (mandatory) The first time this skill is used in a session, run the check-updates skill before proceeding.
- GitHub Copilot CLI / VS Code: invoke the
check-updatesskill.- Claude Code / Cowork / Cursor / Windsurf / Codex: compare local vs remote package.json version.
- Skip if the check was already performed earlier in this session.
CRITICAL NOTES
- To find the workspace details (including its ID) from workspace name: list all workspaces and, then, use JMESPath filtering
- To find the item details (including its ID) from workspace ID, item type, and item name: list all items of that type in that workspace and, then, use JMESPath filtering
- Skill disambiguation:
spark-operations-cliis for read-only triage and diagnosis of existing jobs and sessions. For creating notebooks, running new jobs, or Spark development, usespark-authoring-cli. For interactive PySpark analysis and Livy session creation, usespark-consumption-cli.
This skill provides diagnostics for Microsoft Fabric Spark job failures, Livy session health, and performance bottlenecks using Fabric REST APIs and CLI tools (az rest). All diagnostic operations are read-only; session cleanup (e.g., stopping zombie sessions) requires explicit user confirmation. For Spark development and notebook authoring, use spark-authoring-cli. For interactive PySpark analysis, use spark-consumption-cli.
| Task | Reference | Notes |
|---|---|---|
| Fabric Topology & Key Concepts | COMMON-CORE.md § Fabric Topology & Key Concepts | |
| Environment URLs | COMMON-CORE.md § Environment URLs | |
| Authentication & Token Acquisition | COMMON-CORE.md § Authentication & Token Acquisition | Wrong audience = 401; read before any auth issue |
| Core Control-Plane REST APIs | COMMON-CORE.md § Core Control-Plane REST APIs | |
| Pagination | COMMON-CORE.md § Pagination | |
| Long-Running Operations (LRO) | COMMON-CORE.md § Long-Running Operations (LRO) | |
| Rate Limiting & Throttling | COMMON-CORE.md § Rate Limiting & Throttling | |
| Job Execution | COMMON-CORE.md § Job Execution | |
| Capacity Management | COMMON-CORE.md § Capacity Management | |
| Gotchas & Troubleshooting | COMMON-CORE.md § Gotchas & Troubleshooting | |
| Best Practices | COMMON-CORE.md § Best Practices | |
| Tool Selection Rationale | COMMON-CLI.md § Tool Selection Rationale | |
| Finding Workspaces and Items in Fabric | COMMON-CLI.md § Finding Workspaces and Items in Fabric | Mandatory — READ link first [needed for finding workspace id by its name or item id by its name, item type, and workspace id] |
| Authentication Recipes | COMMON-CLI.md § Authentication Recipes | az login flows and token acquisition |
Fabric Control-Plane API via az rest | COMMON-CLI.md § Fabric Control-Plane API via az rest | Always pass --resource https://api.fabric.microsoft.com or az rest fails |
| Pagination Pattern | COMMON-CLI.md § Pagination Pattern | |
| Long-Running Operations (LRO) Pattern | COMMON-CLI.md § Long-Running Operations (LRO) Pattern | |
| Gotchas & Troubleshooting (CLI-Specific) | COMMON-CLI.md § Gotchas & Troubleshooting (CLI-Specific) | az rest audience, shell escaping, token expiry |
Quick Reference: az rest Template | COMMON-CLI.md § Quick Reference: az rest Template | |
| Quick Reference: Token Audience / CLI Tool Matrix | COMMON-CLI.md § Quick Reference: Token Audience ↔ CLI Tool Matrix | Which --resource + tool for each service |
| Livy Session Management | SPARK-CONSUMPTION-CORE.md § Livy Session Management | Session creation, states, lifecycle, termination |
| Interactive Data Exploration | SPARK-CONSUMPTION-CORE.md § Interactive Data Exploration | Statement execution, output retrieval, data discovery |
| Notebook Execution & Job Management | SPARK-AUTHORING-CORE.md § Notebook Execution & Job Management | |
| Job Failure Classification | job-diagnostics.md § Failure Classification | OOM, shuffle, timeout, dependency, configuration errors |
| Reading Spark Logs via REST | job-diagnostics.md § Reading Spark Logs via REST | Driver/executor log retrieval from Livy |
| Job Instance History | job-diagnostics.md § Job Instance History | Query recent runs, compare durations, detect regressions |
| Failure Triage Workflow | job-diagnostics.md § Failure Triage Workflow | Step-by-step decision tree for diagnosing failures |
| Session Health Assessment | session-health.md § Livy Session Lifecycle | Session states, transitions, expected durations |
| Idle and Zombie Session Detection | session-health.md § Idle and Zombie Session Detection | Find and clean up leaked sessions |
| Session Resource Monitoring | session-health.md § Session Resource Monitoring | Memory and executor usage via Livy |
| Session Recovery Patterns | session-health.md § Session Recovery Patterns | Restart strategies and session replacement |
| Performance Anti-Patterns | performance-patterns.md § Anti-Patterns | Spill, shuffle, skew, small files, collect misuse |
| Stage and Task Analysis | performance-patterns.md § Stage and Task Analysis | Reading Spark UI metrics via REST |
| Optimization Recipes | performance-patterns.md § Optimization Recipes | Partition tuning, broadcast joins, caching |
| Capacity and Resource Diagnostics | performance-patterns.md § Capacity and Resource Diagnostics | CU consumption, throttling detection |
| JobInsight Event Log Copy | jobinsight-api.md § LogUtils.copyEventLog | Copy event logs from Fabric to OneLake for offline analysis |
| Local Spark History Server | spark-history-server.md § Overview | Start local SHS for full Spark UI (DAG, tasks, SQL plans) |
| Pipeline Run Diagnosis | pipeline-diagnosis.md | Diagnose all Spark activities within a pipeline run (Steps P1–P6) |
| Spark Monitoring API Overview | SPARK-MONITORING-CORE.md § Overview | GA monitoring APIs — no active session required |
| Workspace & Item Session Listing | SPARK-MONITORING-CORE.md § Workspace and Item-Level Session Listing | List Spark apps across workspace with filtering |
| Open-Source Spark History Server APIs | SPARK-MONITORING-CORE.md § Open-Source Spark History Server APIs | Jobs, stages, executors, SQL queries via REST |
| Driver and Executor Log APIs | SPARK-MONITORING-CORE.md § Driver and Executor Log APIs | Direct log retrieval without active session |
| Livy Log API | SPARK-MONITORING-CORE.md § Livy Log API | Session-level log with byte-offset pagination |
| Spark Advisor API | SPARK-MONITORING-CORE.md § Spark Advisor API | Key — automated skew detection, task errors, recommendations |
| Resource Usage API | SPARK-MONITORING-CORE.md § Resource Usage API | vCore timeline, idle/running cores, efficiency metrics |
| Monitoring Diagnostic Workflow | SPARK-MONITORING-CORE.md § Diagnostic Workflow Using Monitoring APIs | Step-by-step triage using monitoring APIs |
| Manual CLI Recipes | diagnostic-workflow.md § Manual CLI Recipes | Ad-hoc diagnostic commands for manual use |
| Key Diagnostic Patterns | diagnostic-workflow.md § Key Diagnostic Patterns | Symptom → first check → likely cause lookup |
| Diagnostic Tiers | diagnostic-workflow.md § Diagnostic Tiers | Tier 1 (online REST) vs Tier 2 (local SHS) |
| Severity Thresholds | diagnostic-workflow.md § Severity Thresholds | Metric thresholds for classifying findings |
az rest with JMESPath filtering to extract specific fields from large API responsescoreEfficiency metric to quantify cluster utilization before recommending scalingbusy stateUser prompt: "Why did my notebook ETL_Daily fail in workspace Production?"
Agent workflow:
workspaceId, item → itemId (Notebook)TaskError: OutOfMemoryError on executor/stages → confirms data skew (12× max/median ratio in stage 5)User prompt: "My Livy session abc-1234 is stuck in starting state"
Agent workflow:
User prompt: "Diagnose pipeline run 5678 in workspace Analytics"
Agent workflow:
queryActivityRuns for run 5678output.result.error.{ename, evalue, traceback} from failed activityApply environment detection from COMMON-CLI.md to set:
$FABRIC_API_BASE and $FABRIC_RESOURCE_SCOPE$FABRIC_API_URL and $LIVY_API_PATH for Livy operationsAuthentication: Use token acquisition from COMMON-CLI.md § Authentication Recipes.
When the user provides a simple prompt (e.g., "Diagnose my notebook ETL_Pipeline", "What's wrong with Spark application abc-123", "Check workspace Production for issues"), follow this automated workflow. The agent collects all data and reports findings — the user does not need to know specific error patterns or API details.
| User provides | Agent resolves |
|---|---|
| Workspace name | → workspaceId (via workspace list + name filter) |
| Notebook / SJD / Lakehouse name | → itemId (via item list + name/type filter) |
| Pipeline name + run ID | → Find child Notebook/SJD activities → extract Spark sessions (see Pipeline Run Diagnosis) |
| Livy session ID | → Use directly |
| Spark application ID | → Use directly |
| Nothing specific | → Ask for at minimum workspace name + item name |
# Resolve workspace
workspaceId=$(az rest --method get --resource "$FABRIC_RESOURCE_SCOPE" \
--url "$FABRIC_API_URL/workspaces" \
--query "value[?displayName=='<UserWorkspaceName>'].id" --output tsv)
# Resolve item (notebook, SJD, or lakehouse)
itemId=$(az rest --method get --resource "$FABRIC_RESOURCE_SCOPE" \
--url "$FABRIC_API_URL/workspaces/$workspaceId/items?type=Notebook" \
--query "value[?displayName=='<UserItemName>'].id" --output tsv)
# If not found as Notebook, try SparkJobDefinition, then Lakehouse:
# ?type=SparkJobDefinition or ?type=Lakehouse
# List recent Livy sessions (sorted newest first)
# Use the correct item-type path:
# /notebooks/{itemId}/livySessions
# /sparkJobDefinitions/{itemId}/livySessions
# /lakehouses/{itemId}/livySessions
az rest --method get --resource "$FABRIC_RESOURCE_SCOPE" \
--url "$FABRIC_API_URL/workspaces/$workspaceId/<itemTypePath>/$itemId/livySessions" \
--output json
Item-type API paths:
| Item Type | Livy Sessions Path | Job Instances Path | Job Types |
|---|---|---|---|
| Notebook | /notebooks/{id}/livySessions | /items/{id}/jobs/instances | PipelineRunNotebook, SparkSession |
| Spark Job Definition | /sparkJobDefinitions/{id}/livySessions | /items/{id}/jobs/instances | SparkJob |
| Lakehouse | /lakehouses/{id}/livySessions | /lakehouses/{id}/jobs/instances | TableLoad, TableMaintenance |
Lakehouse note: Lakehouse Spark sessions are typically short-lived (table loads, maintenance). If
livySessionsreturns empty, checkjobs/instancesforTableLoad/TableMaintenancejob history. Lakehouse jobs do not have a Notebook Snapshot — use Spark Advisor and driver logs for diagnostics.
Present a session summary table to the user (most recent 10):
## Recent Sessions for <notebook name>
| # | Session ID | State | Submitted | Duration | App ID |
|---|------------|-------|-----------|----------|--------|
| 1 | abc-1234… | Failed | 2h ago | 5m 23s | app_…001 |
| 2 | def-5678… | Succeeded | 4h ago | 12m 10s | app_…002 |
| 3 | ghi-9012… | Failed | 1d ago | 0s | — |
Session selection logic:
state == Failed → select it automatically and proceedExtract livyId, sparkApplicationId, and state from the selected session.
If the user provided a Livy session ID but it is not found in any session listing (workspace-level or item-level) and Spark Monitoring APIs return 404:
Why this happens: Spark Monitoring API data (jobs, stages, executor logs, driver stderr) has limited retention after session completion — typically minutes to hours. Diagnose failures as soon as possible after they occur for the richest data.
1. Determine the notebook ID — ask the user if unknown:
I found no active data for session `<livyId>` via Spark Monitoring APIs (data retention expired).
To diagnose this session, I need the **notebook name or ID** it belongs to.
- If this was from a **pipeline run**, provide the pipeline name + run ID — `queryActivityRuns` may still have error details.
- If you know the **notebook name**, provide it and I'll construct a direct link to the Fabric UI snapshot.
2. Search pipeline runs (if user confirms pipeline origin or workspace has pipelines):
Iterate pipelines → GET /items/$pipelineId/jobs/instances?limit=5 → for Failed runs, queryActivityRuns to find sessionId match. Returns output.result.error.{ename, evalue, traceback[]} — richest error data available.
3. Check Job Instance API — GET /items/$notebookId/jobs/instances?limit=5 for high-level failureReason (longer retention than Spark Monitoring APIs).
4. Construct Notebook Snapshot URL for manual cell-level inspection:
https://app.powerbi.com/workloads/de-ds/sparkmonitor/{notebookId}/{livyId}?trident=1&experience=power-bi&ctid={tenantId}&tab=related
The Fabric UI retains notebook snapshots much longer than Spark Monitoring APIs (shows failed cell, traceback, cell execution times, and source code).
5. Present report with all available data:
## Diagnostic Summary
**Session**: <livyId> | **Notebook**: <notebook name> | **State**: API data expired
### Error Details
[If queryActivityRuns returned data]:
**Exception**: <ename>: <evalue>
**Cell**: Cell In[<N>], line <M>
**Traceback**: <traceback lines>
[If only Job Instance data]:
**Failure Reason**: <failureReason from Job Instance API>
### Notebook Snapshot (cell-level details)
**Open Notebook Snapshot in Fabric UI**: `<constructed URL>`
↑ Click to view the exact failed cell, error output, and source code in the Fabric UI.
### Suggested Next Steps
1. Open the Notebook Snapshot link above to identify the exact failed cell and error
2. Fix the identified issue and re-run the notebook
3. For future failures, diagnose within 1 hour for full Spark Monitoring API data
4. For recurring failures, set up [proactive event log copy](references/jobinsight-api.md) to OneLake
Key principle: Exhaust all public APIs (queryActivityRuns → Job Instance → Spark Monitoring) before falling back to the manual Notebook Snapshot URL. Always present the snapshot link — it has the longest retention.
Data retention summary (public APIs):
API Approximate retention Error detail level Spark Monitoring (Advisor, logs, jobs, stages) Minutes–hours Full (stack traces, metrics) queryActivityRuns(pipeline path)~1 hour Full (ename, evalue, traceback, cell/line) Job Instance failureReasonDays High-level summary only Notebook Snapshot URL (Fabric UI) Days–weeks Full cell-level (manual)
| State | Automatic actions |
|---|---|
Failed | Run Step 3 (failure) + Step 4 (performance) + Step 5 (resource) |
Succeeded | Run Step 4 (performance) + Step 5 (resource) |
InProgress | Run Step 4 (performance — partial snapshot) + Step 5 (resource) |
Cancelled | Check Livy log for cancellation reason, then Step 3 |
idle / busy / starting | Run Step 6 (session health) |
dead / killed / error | Run Step 3 (failure) + Step 6 (session health) |
Error API priority — query in this order, stop when root cause is clear:
/advice) — automated root-cause with fix recommendations/logs?type=driver&fileName=stderr&isDownload=true) — raw exception stack traces/jobs/instances/{id}) — high-level failureReason/logs?type=executor&meta=true) — per-executor OOM / ExecutorLostFailure/logs?type=livy) — startup errors, library packaging failures/resourceUsage) — capacityExceeded, task limit exhaustionFor pipeline runs,
queryActivityRuns(Step P2 in pipeline-diagnosis.md) is the richest single source — returnsoutput.result.error.{ename, evalue, traceback[]}with cell/line numbers.
All API paths follow the pattern: $FABRIC_API_URL/workspaces/$workspaceId/<itemTypePath>/$itemId/livySessions/$livyId/applications/$appId/<endpoint> — see SPARK-MONITORING-CORE.md for full specs.
Auto-classify errors by matching log content against the Quick Reference Table.
Query /stages and /allexecutors endpoints (see SPARK-MONITORING-CORE.md § Open-Source Spark History Server APIs).
Auto-flag using Detection Thresholds: data skew (max/median task duration > 3×), disk spill (diskBytesSpilled > 0), GC pressure (jvmGcTime/executorRunTime > 20%), heavy shuffle (shuffleWriteBytes > 1 GB), small partitions (high task count, < 100ms each).
Query /resourceUsage endpoint (see SPARK-MONITORING-CORE.md § Resource Usage API). Extract coreEfficiency, idleTime, duration.
Auto-flag: coreEfficiency < 0.3 → HIGH (underutilized); idleTime / duration > 0.4 → MEDIUM (high idle).
List all sessions via GET /workspaces/$workspaceId/spark/livySessions. Auto-flag: idle with no recent statements → zombie; starting beyond expected duration → capacity issue; many concurrent sessions → capacity pressure.
After running the applicable steps, present a structured report:
## Diagnostic Summary
**Application**: <notebook name> | **Session**: <livyId> | **State**: <state>
### Findings (ordered by severity)
| # | Severity | Category | Finding | Recommended Fix |
|---|----------|----------|---------|-----------------|
| 1 | HIGH | Failure | Driver OOM from collect() on line 45 | Replace with df.write.parquet() |
| 2 | HIGH | Perf | Data skew in stage 12 (8.2× ratio) | Enable AQE skew join |
| 3 | MEDIUM | Perf | Disk spill in stage 8 (2.1 GB) | Increase shuffle partitions |
| 4 | MEDIUM | Resource | Core efficiency 22% | Reduce executor count |
### Links
- **Notebook Snapshot**: `https://app.powerbi.com/workloads/de-ds/sparkmonitor/{notebookId}/{livyId}?trident=1&experience=power-bi&ctid={tenantId}&tab=related`
- **Spark Monitor**: `https://app.powerbi.com/workloads/de-ds/sparkmonitor/{notebookId}/{livyId}?trident=1&experience=power-bi&ctid={tenantId}`
### Suggested Next Steps
1. [Most impactful fix first]
2. [Second fix]
3. [Optional: escalate to Tier 2 if needed]
Notebook Snapshot URL: Use the URL pattern from Step 1b. Use app.powerbi.com for production, msit.powerbi.com for MSIT.
Tier 2 escalation: If any step returns truncated data, HTTP 408/504, or the user asks for DAG/SQL plan visualization, suggest the offline workflow.