Help us improve
Share bugs, ideas, or general feedback.
From fabric-operations
Diagnose failed Spark jobs, unhealthy Livy sessions, and performance bottlenecks in Microsoft Fabric via read-only CLI triage.
npx claudepluginhub microsoft/skills-for-fabric --plugin fabric-operationsHow this skill is triggered — by the user, by Claude, or both
Slash command
/fabric-operations:spark-operations-cliThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
> **Update Check — ONCE PER SESSION (mandatory)**
Analyzes lakehouse data interactively using Fabric Lakehouse Livy API sessions and PySpark/Spark SQL for DataFrames, joins, Delta time-travel, and JSON analysis.
Develops Microsoft Fabric Spark/data engineering workflows and writes notebook code (PySpark, Scala, SparkR, SQL). Manages workspaces, lakehouses, notebooks, and pipelines via REST APIs.
Troubleshoots Spark applications on AWS EMR, Glue, and SageMaker: analyzes PySpark/Scala job failures, identifies bottlenecks, provides code recommendations and optimizations.
Share bugs, ideas, or general feedback.
Update Check — ONCE PER SESSION (mandatory) The first time this skill is used in a session, run the check-updates skill before proceeding.
- GitHub Copilot CLI / VS Code: invoke the
check-updatesskill.- Claude Code / Cowork / Cursor / Windsurf / Codex: compare local vs remote package.json version.
- Skip if the check was already performed earlier in this session.
CRITICAL NOTES
- To find the workspace details (including its ID) from workspace name: list all workspaces and, then, use JMESPath filtering
- To find the item details (including its ID) from workspace ID, item type, and item name: list all items of that type in that workspace and, then, use JMESPath filtering
- Skill disambiguation:
spark-operations-cliis for read-only triage and diagnosis of existing jobs and sessions. For creating notebooks, running new jobs, or Spark development, usespark-authoring-cli. For interactive PySpark analysis and Livy session creation, usespark-consumption-cli.
This skill provides diagnostics for Microsoft Fabric Spark job failures, Livy session health, and performance bottlenecks using Fabric REST APIs and CLI tools (az rest). All diagnostic operations are read-only; session cleanup (e.g., stopping zombie sessions) requires explicit user confirmation. For Spark development and notebook authoring, use spark-authoring-cli. For interactive PySpark analysis, use spark-consumption-cli.
| Task | Reference | Notes |
|---|---|---|
| Fabric Topology & Key Concepts | COMMON-CORE.md § Fabric Topology & Key Concepts | |
| Environment URLs | COMMON-CORE.md § Environment URLs | |
| Authentication & Token Acquisition | COMMON-CORE.md § Authentication & Token Acquisition | Wrong audience = 401; read before any auth issue |
| Core Control-Plane REST APIs | COMMON-CORE.md § Core Control-Plane REST APIs | |
| Pagination | COMMON-CORE.md § Pagination | |
| Long-Running Operations (LRO) | COMMON-CORE.md § Long-Running Operations (LRO) | |
| Rate Limiting & Throttling | COMMON-CORE.md § Rate Limiting & Throttling | |
| Job Execution | COMMON-CORE.md § Job Execution | |
| Capacity Management | COMMON-CORE.md § Capacity Management | |
| Gotchas & Troubleshooting | COMMON-CORE.md § Gotchas & Troubleshooting | |
| Best Practices | COMMON-CORE.md § Best Practices | |
| Tool Selection Rationale | COMMON-CLI.md § Tool Selection Rationale | |
| Finding Workspaces and Items in Fabric | COMMON-CLI.md § Finding Workspaces and Items in Fabric | Mandatory — READ link first [needed for finding workspace id by its name or item id by its name, item type, and workspace id] |
| Authentication Recipes | COMMON-CLI.md § Authentication Recipes | az login flows and token acquisition |
Fabric Control-Plane API via az rest | COMMON-CLI.md § Fabric Control-Plane API via az rest | Always pass --resource https://api.fabric.microsoft.com or az rest fails |
| Pagination Pattern | COMMON-CLI.md § Pagination Pattern | |
| Long-Running Operations (LRO) Pattern | COMMON-CLI.md § Long-Running Operations (LRO) Pattern | |
| Gotchas & Troubleshooting (CLI-Specific) | COMMON-CLI.md § Gotchas & Troubleshooting (CLI-Specific) | az rest audience, shell escaping, token expiry |
Quick Reference: az rest Template | COMMON-CLI.md § Quick Reference: az rest Template | |
| Quick Reference: Token Audience / CLI Tool Matrix | COMMON-CLI.md § Quick Reference: Token Audience ↔ CLI Tool Matrix | Which --resource + tool for each service |
| Livy Session Management | SPARK-CONSUMPTION-CORE.md § Livy Session Management | Session creation, states, lifecycle, termination |
| Interactive Data Exploration | SPARK-CONSUMPTION-CORE.md § Interactive Data Exploration | Statement execution, output retrieval, data discovery |
| Notebook Execution & Job Management | SPARK-AUTHORING-CORE.md § Notebook Execution & Job Management | |
| Job Failure Classification | job-diagnostics.md § Failure Classification | OOM, shuffle, timeout, dependency, configuration errors |
| Reading Spark Logs via REST | job-diagnostics.md § Reading Spark Logs via REST | Driver/executor log retrieval from Livy |
| Job Instance History | job-diagnostics.md § Job Instance History | Query recent runs, compare durations, detect regressions |
| Failure Triage Workflow | job-diagnostics.md § Failure Triage Workflow | Step-by-step decision tree for diagnosing failures |
| Session Health Assessment | session-health.md § Livy Session Lifecycle | Session states, transitions, expected durations |
| Idle and Zombie Session Detection | session-health.md § Idle and Zombie Session Detection | Find and clean up leaked sessions |
| Session Resource Monitoring | session-health.md § Session Resource Monitoring | Memory and executor usage via Livy |
| Session Recovery Patterns | session-health.md § Session Recovery Patterns | Restart strategies and session replacement |
| Performance Anti-Patterns | performance-patterns.md § Anti-Patterns | Spill, shuffle, skew, small files, collect misuse |
| Stage and Task Analysis | performance-patterns.md § Stage and Task Analysis | Reading Spark UI metrics via REST |
| Optimization Recipes | performance-patterns.md § Optimization Recipes | Partition tuning, broadcast joins, caching |
| Capacity and Resource Diagnostics | performance-patterns.md § Capacity and Resource Diagnostics | CU consumption, throttling detection |
| JobInsight Event Log Copy | jobinsight-api.md § LogUtils.copyEventLog | Copy event logs from Fabric to OneLake for offline analysis |
| Local Spark History Server | spark-history-server.md § Overview | Start local SHS for full Spark UI (DAG, tasks, SQL plans) |
| Pipeline Run Diagnosis | pipeline-diagnosis.md | Diagnose all Spark activities within a pipeline run (Steps P1–P6) |
| Spark Monitoring API Overview | SPARK-MONITORING-CORE.md § Overview | GA monitoring APIs — no active session required |
| Workspace & Item Session Listing | SPARK-MONITORING-CORE.md § Workspace and Item-Level Session Listing | List Spark apps across workspace with filtering |
| Open-Source Spark History Server APIs | SPARK-MONITORING-CORE.md § Open-Source Spark History Server APIs | Jobs, stages, executors, SQL queries via REST |
| Driver and Executor Log APIs | SPARK-MONITORING-CORE.md § Driver and Executor Log APIs | Direct log retrieval without active session |
| Livy Log API | SPARK-MONITORING-CORE.md § Livy Log API | Session-level log with byte-offset pagination |
| Spark Advisor API | SPARK-MONITORING-CORE.md § Spark Advisor API | Key — automated skew detection, task errors, recommendations |
| Resource Usage API | SPARK-MONITORING-CORE.md § Resource Usage API | vCore timeline, idle/running cores, efficiency metrics |
| Monitoring Diagnostic Workflow | SPARK-MONITORING-CORE.md § Diagnostic Workflow Using Monitoring APIs | Step-by-step triage using monitoring APIs |
| Manual CLI Recipes | diagnostic-workflow.md § Manual CLI Recipes | Ad-hoc diagnostic commands for manual use |
| Key Diagnostic Patterns | diagnostic-workflow.md § Key Diagnostic Patterns | Symptom → first check → likely cause lookup |
| Diagnostic Tiers | diagnostic-workflow.md § Diagnostic Tiers | Tier 1 (online REST) vs Tier 2 (local SHS) |
| Severity Thresholds | diagnostic-workflow.md § Severity Thresholds | Metric thresholds for classifying findings |
az rest with JMESPath filtering to extract specific fields from large API responsescoreEfficiency metric to quantify cluster utilization before recommending scalingbusy stateUser prompt: "Why did my notebook ETL_Daily fail in workspace Production?"
Agent workflow:
workspaceId, item → itemId (Notebook)TaskError: OutOfMemoryError on executor/stages → confirms data skew (12× max/median ratio in stage 5)User prompt: "My Livy session abc-1234 is stuck in starting state"
Agent workflow:
User prompt: "Diagnose pipeline run 5678 in workspace Analytics"
Agent workflow:
queryActivityRuns for run 5678output.result.error.{ename, evalue, traceback} from failed activityApply environment detection from COMMON-CLI.md to set:
$FABRIC_API_BASE and $FABRIC_RESOURCE_SCOPE$FABRIC_API_URL and $LIVY_API_PATH for Livy operationsAuthentication: Use token acquisition from COMMON-CLI.md § Authentication Recipes.
When the user provides a simple prompt (e.g., "Diagnose my notebook ETL_Pipeline", "What's wrong with Spark application abc-123", "Check workspace Production for issues"), follow this automated workflow. The agent collects all data and reports findings — the user does not need to know specific error patterns or API details.
| User provides | Agent resolves |
|---|---|
| Workspace name | → workspaceId (via workspace list + name filter) |
| Notebook / SJD / Lakehouse name | → itemId (via item list + name/type filter) |
| Pipeline name + run ID | → Find child Notebook/SJD activities → extract Spark sessions (see Pipeline Run Diagnosis) |
| Livy session ID | → Use directly |
| Spark application ID | → Use directly |
| Nothing specific | → Ask for at minimum workspace name + item name |
# Resolve workspace
workspaceId=$(az rest --method get --resource "$FABRIC_RESOURCE_SCOPE" \
--url "$FABRIC_API_URL/workspaces" \
--query "value[?displayName=='<UserWorkspaceName>'].id" --output tsv)
# Resolve item (notebook, SJD, or lakehouse)
itemId=$(az rest --method get --resource "$FABRIC_RESOURCE_SCOPE" \
--url "$FABRIC_API_URL/workspaces/$workspaceId/items?type=Notebook" \
--query "value[?displayName=='<UserItemName>'].id" --output tsv)
# If not found as Notebook, try SparkJobDefinition, then Lakehouse:
# ?type=SparkJobDefinition or ?type=Lakehouse
# List recent Livy sessions (sorted newest first)
# Use the correct item-type path:
# /notebooks/{itemId}/livySessions
# /sparkJobDefinitions/{itemId}/livySessions
# /lakehouses/{itemId}/livySessions
az rest --method get --resource "$FABRIC_RESOURCE_SCOPE" \
--url "$FABRIC_API_URL/workspaces/$workspaceId/<itemTypePath>/$itemId/livySessions" \
--output json
Item-type API paths:
| Item Type | Livy Sessions Path | Job Instances Path | Job Types |
|---|---|---|---|
| Notebook | /notebooks/{id}/livySessions | /items/{id}/jobs/instances | PipelineRunNotebook, SparkSession |
| Spark Job Definition | /sparkJobDefinitions/{id}/livySessions | /items/{id}/jobs/instances | SparkJob |
| Lakehouse | /lakehouses/{id}/livySessions | /lakehouses/{id}/jobs/instances | TableLoad, TableMaintenance |
Lakehouse note: Lakehouse Spark sessions are typically short-lived (table loads, maintenance). If
livySessionsreturns empty, checkjobs/instancesforTableLoad/TableMaintenancejob history. Lakehouse jobs do not have a Notebook Snapshot — use Spark Advisor and driver logs for diagnostics.
Present a session summary table to the user (most recent 10):
## Recent Sessions for <notebook name>
| # | Session ID | State | Submitted | Duration | App ID |
|---|------------|-------|-----------|----------|--------|
| 1 | abc-1234… | Failed | 2h ago | 5m 23s | app_…001 |
| 2 | def-5678… | Succeeded | 4h ago | 12m 10s | app_…002 |
| 3 | ghi-9012… | Failed | 1d ago | 0s | — |
Session selection logic:
state == Failed → select it automatically and proceedExtract livyId, sparkApplicationId, and state from the selected session.
If the user provided a Livy session ID but it is not found in any session listing (workspace-level or item-level) and Spark Monitoring APIs return 404:
Why this happens: Spark Monitoring API data (jobs, stages, executor logs, driver stderr) has limited retention after session completion — typically minutes to hours. Diagnose failures as soon as possible after they occur for the richest data.
1. Determine the notebook ID — ask the user if unknown:
I found no active data for session `<livyId>` via Spark Monitoring APIs (data retention expired).
To diagnose this session, I need the **notebook name or ID** it belongs to.
- If this was from a **pipeline run**, provide the pipeline name + run ID — `queryActivityRuns` may still have error details.
- If you know the **notebook name**, provide it and I'll construct a direct link to the Fabric UI snapshot.
2. Search pipeline runs (if user confirms pipeline origin or workspace has pipelines):
Iterate pipelines → GET /items/$pipelineId/jobs/instances?limit=5 → for Failed runs, queryActivityRuns to find sessionId match. Returns output.result.error.{ename, evalue, traceback[]} — richest error data available.
3. Check Job Instance API — GET /items/$notebookId/jobs/instances?limit=5 for high-level failureReason (longer retention than Spark Monitoring APIs).
4. Construct Notebook Snapshot URL for manual cell-level inspection:
https://app.powerbi.com/workloads/de-ds/sparkmonitor/{notebookId}/{livyId}?trident=1&experience=power-bi&ctid={tenantId}&tab=related
The Fabric UI retains notebook snapshots much longer than Spark Monitoring APIs (shows failed cell, traceback, cell execution times, and source code).
5. Present report with all available data:
## Diagnostic Summary
**Session**: <livyId> | **Notebook**: <notebook name> | **State**: API data expired
### Error Details
[If queryActivityRuns returned data]:
**Exception**: <ename>: <evalue>
**Cell**: Cell In[<N>], line <M>
**Traceback**: <traceback lines>
[If only Job Instance data]:
**Failure Reason**: <failureReason from Job Instance API>
### Notebook Snapshot (cell-level details)
**Open Notebook Snapshot in Fabric UI**: `<constructed URL>`
↑ Click to view the exact failed cell, error output, and source code in the Fabric UI.
### Suggested Next Steps
1. Open the Notebook Snapshot link above to identify the exact failed cell and error
2. Fix the identified issue and re-run the notebook
3. For future failures, diagnose within 1 hour for full Spark Monitoring API data
4. For recurring failures, set up [proactive event log copy](references/jobinsight-api.md) to OneLake
Key principle: Exhaust all public APIs (queryActivityRuns → Job Instance → Spark Monitoring) before falling back to the manual Notebook Snapshot URL. Always present the snapshot link — it has the longest retention.
Data retention summary (public APIs):
API Approximate retention Error detail level Spark Monitoring (Advisor, logs, jobs, stages) Minutes–hours Full (stack traces, metrics) queryActivityRuns(pipeline path)~1 hour Full (ename, evalue, traceback, cell/line) Job Instance failureReasonDays High-level summary only Notebook Snapshot URL (Fabric UI) Days–weeks Full cell-level (manual)
| State | Automatic actions |
|---|---|
Failed | Run Step 3 (failure) + Step 4 (performance) + Step 5 (resource) |
Succeeded | Run Step 4 (performance) + Step 5 (resource) |
InProgress | Run Step 4 (performance — partial snapshot) + Step 5 (resource) |
Cancelled | Check Livy log for cancellation reason, then Step 3 |
idle / busy / starting | Run Step 6 (session health) |
dead / killed / error | Run Step 3 (failure) + Step 6 (session health) |
Error API priority — query in this order, stop when root cause is clear:
/advice) — automated root-cause with fix recommendations/logs?type=driver&fileName=stderr&isDownload=true) — raw exception stack traces/jobs/instances/{id}) — high-level failureReason/logs?type=executor&meta=true) — per-executor OOM / ExecutorLostFailure/logs?type=livy) — startup errors, library packaging failures/resourceUsage) — capacityExceeded, task limit exhaustionFor pipeline runs,
queryActivityRuns(Step P2 in pipeline-diagnosis.md) is the richest single source — returnsoutput.result.error.{ename, evalue, traceback[]}with cell/line numbers.
All API paths follow the pattern: $FABRIC_API_URL/workspaces/$workspaceId/<itemTypePath>/$itemId/livySessions/$livyId/applications/$appId/<endpoint> — see SPARK-MONITORING-CORE.md for full specs.
Auto-classify errors by matching log content against the Quick Reference Table.
Query /stages and /allexecutors endpoints (see SPARK-MONITORING-CORE.md § Open-Source Spark History Server APIs).
Auto-flag using Detection Thresholds: data skew (max/median task duration > 3×), disk spill (diskBytesSpilled > 0), GC pressure (jvmGcTime/executorRunTime > 20%), heavy shuffle (shuffleWriteBytes > 1 GB), small partitions (high task count, < 100ms each).
Query /resourceUsage endpoint (see SPARK-MONITORING-CORE.md § Resource Usage API). Extract coreEfficiency, idleTime, duration.
Auto-flag: coreEfficiency < 0.3 → HIGH (underutilized); idleTime / duration > 0.4 → MEDIUM (high idle).
List all sessions via GET /workspaces/$workspaceId/spark/livySessions. Auto-flag: idle with no recent statements → zombie; starting beyond expected duration → capacity issue; many concurrent sessions → capacity pressure.
After running the applicable steps, present a structured report:
## Diagnostic Summary
**Application**: <notebook name> | **Session**: <livyId> | **State**: <state>
### Findings (ordered by severity)
| # | Severity | Category | Finding | Recommended Fix |
|---|----------|----------|---------|-----------------|
| 1 | HIGH | Failure | Driver OOM from collect() on line 45 | Replace with df.write.parquet() |
| 2 | HIGH | Perf | Data skew in stage 12 (8.2× ratio) | Enable AQE skew join |
| 3 | MEDIUM | Perf | Disk spill in stage 8 (2.1 GB) | Increase shuffle partitions |
| 4 | MEDIUM | Resource | Core efficiency 22% | Reduce executor count |
### Links
- **Notebook Snapshot**: `https://app.powerbi.com/workloads/de-ds/sparkmonitor/{notebookId}/{livyId}?trident=1&experience=power-bi&ctid={tenantId}&tab=related`
- **Spark Monitor**: `https://app.powerbi.com/workloads/de-ds/sparkmonitor/{notebookId}/{livyId}?trident=1&experience=power-bi&ctid={tenantId}`
### Suggested Next Steps
1. [Most impactful fix first]
2. [Second fix]
3. [Optional: escalate to Tier 2 if needed]
Notebook Snapshot URL: Use the URL pattern from Step 1b. Use app.powerbi.com for production, msit.powerbi.com for MSIT.
Tier 2 escalation: If any step returns truncated data, HTTP 408/504, or the user asks for DAG/SQL plan visualization, suggest the offline workflow.