From fabric-consumption
Implements end-to-end Medallion Architecture (Bronze/Silver/Gold) lakehouse patterns in Microsoft Fabric using PySpark, Delta Lake, and Fabric Pipelines. For multi-layer workspace setup, ingestion-to-analytics pipelines, data quality enforcement, and Spark optimization per layer.
npx claudepluginhub microsoft/skills-for-fabric --plugin fabric-consumptionThis skill uses the workspace's default tool permissions.
> **Update Check — ONCE PER SESSION (mandatory)**
Builds Databricks Delta Lake ETL pipelines using medallion architecture, Auto Loader for incremental ingestion, and MERGE INTO for Silver layer upserts.
Creates, configures, and updates Databricks Lakeflow Spark Declarative Pipelines (SDP/LDP) using serverless compute. Handles data ingestion via streaming tables, materialized views, CDC, SCD Type 2, and Auto Loader.
Explains Microsoft Fabric features like Direct Lake mode, OneLake delta tables, semantic models, lakehouse Power BI integration, and comparisons to Import/DirectQuery.
Share bugs, ideas, or general feedback.
Update Check — ONCE PER SESSION (mandatory) The first time this skill is used in a session, run the check-updates skill before proceeding.
- GitHub Copilot CLI / VS Code: invoke the
check-updatesskill.- Claude Code / Cowork / Cursor / Windsurf / Codex: compare local vs remote package.json version.
- Skip if the check was already performed earlier in this session.
CRITICAL NOTES
- To find the workspace details (including its ID) from workspace name: list all workspaces and, then, use JMESPath filtering
- To find the item details (including its ID) from workspace ID, item type, and item name: list all items of that type in that workspace and, then, use JMESPath filtering
Read these companion documents — they contain the foundational context this skill depends on:
az rest, az login, token acquisition, Fabric REST via CLI.ipynb structure requirements, cell format, getDefinition/updateDefinition workflowFor Spark-specific optimization details, see data-engineering-patterns.md.
Medallion Architecture is a data lakehouse pattern with three progressive layers:
| Layer | Purpose | Optimization Profile | Use Case |
|---|---|---|---|
| Bronze (Raw) | Land raw data exactly as received | Write-optimized, append-only, partitioned by ingestion date | Audit trail, reprocessing, lineage |
| Silver (Cleaned) | Deduplicated, validated, conformed data | Balanced read/write, partitioned by business date | Feature engineering, operational reporting |
| Gold (Aggregated) | Pre-calculated metrics for analytics | Read-optimized (ZORDER, compaction), partitioned by month/year | Power BI reports, dashboards, ad-hoc analytics via SQL endpoint |
mergeSchema when sources change.ipynb validation + Fabric nuances in notebook-api-operations.md when creating notebooks via REST API — every code cell must include "outputs": [] and "execution_count": nullFiles/ first (via curl, OneLake API, or Fabric pipeline Copy activity), then read from the lakehouse path.ipynb structure — missing execution_count: null or outputs: [] on code cells causes silent failures or "Job instance failed without detail error"When setting up a medallion workspace, guide LLM to generate commands for:
{project}-bronze-{env}{project}-silver-{env}{project}-gold-{env}{project}_bronze lakehouse{project}_silver lakehouse{project}_gold lakehouse.ipynb validation + Fabric nuancesmetadata.dependencies.lakehouse with the correct lakehouse ID (see notebook-api-operations.md § Default Lakehouse Binding):
updateDefinition returned Succeeded; this is sufficient confirmation that content and lakehouse binding persisted. Do NOT call getDefinition to re-verify — it is an async LRO and adds unnecessary latency.POST .../jobs/instances?jobType=RunNotebook with the correct defaultLakehouse in execution config (both id and name required)If the user explicitly asks for a single workspace deployment (for example, POC/small team/monolithic pattern), keep the current approach:
Parameterize by environment: workspace name suffix (-dev, -prod), data volume (sample vs full), capacity SKU, and Bronze retention period.
When a user requests data ingestion into the Bronze layer, guide LLM to:
Files/ folder before Spark can read it — use one of:
curl — upload files via REST API using storage.azure.com token (see COMMON-CLI.md § OneLake Data Access)notebookutils.fs — copy from mounted storage paths within a notebookspark.read.format("csv").load("https://...") will failFiles/, read using lakehouse-relative paths (e.g., spark.read.format("csv").load("Files/landing/daily/"))When a user requests Bronze-to-Silver transformation, guide LLM to:
mergeSchema option when source schemas change; coordinate downstream updates to Gold tables and Power BI datasetsWhen a user requests Gold analytics tables, guide LLM to generate:
spark.conf.set("spark.sql.parquet.vorder.default", "true")
spark.conf.set("spark.databricks.delta.optimizeWrite.enabled", "true")
spark.conf.set("spark.databricks.delta.optimizeWrite.binSize", "1g")
vorder.default) — applies Fabric's columnar sort optimization to all Parquet files, dramatically improving Direct Lake and SQL endpoint read performanceoptimizeWrite.enabled) — coalesces small partitions into optimally-sized files (target ~1 GB per binSize), reducing file count and improving scan efficiencyWhen setting up medallion architecture end-to-end, the LLM must not stop after creating notebooks and deploying code. The complete lifecycle is:
Create Resources → Deploy Content → Bind Lakehouses → Execute → Verify Results
.ipynb structure (see notebook-api-operations.md)metadata.dependencies.lakehouse in the .ipynb payload with:
default_lakehouse: the target lakehouse GUIDdefault_lakehouse_name: the lakehouse display namedefault_lakehouse_workspace_id: the workspace GUIDupdateDefinition with the Base64-encoded .ipynb payload (content + lakehouse binding together)updateDefinition LRO returned Succeeded; that is sufficient. Do NOT call getDefinition to re-verify — it is an async LRO and adds significant latency per notebook.POST .../jobs/instances?jobType=RunNotebook:
defaultLakehouse with both id and name in executionData.configurationCompleted → run Silver → poll → run Gold → pollIf the flow stops after deploying notebook code without binding or executing:
spark.sql() and relative paths (Tables/, Files/) fail at runtimeAfter Gold tables are populated, connect Power BI to surface the analytics. Build a semantic model on top of the Gold lakehouse, using DirectLake.
GET /v1/workspaces/{workspaceId}/lakehouses/{goldLakehouseId} and extract properties.sqlEndpointProperties.connectionString and provisioningStatus; wait until status is Successsqlcmd (see COMMON-CLI.md § SQL / TDS Data-Plane Access) and confirm the target table exists:
SELECT TABLE_NAME FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_NAME = 'nyc_taxi_daily_summary'
POST /v1/workspaces/{workspaceId}/items with type: "SemanticModel" then deploy definition via updateDefinition using TMDL format (see ITEM-DEFINITIONS-CORE.md § SemanticModel):
nyc_taxi_daily_summary)Total Trips, Avg Fare, Total Revenue, Month over Month Growth)POST /v1/workspaces/{workspaceId}/items with type: "Report" then deploy definition via updateDefinition using PBIR format (see ITEM-DEFINITIONS-CORE.md § Report):
definition.pbirpowerbi-consumption-cli skill to run DAX queries against the semantic model and confirm data flows from Gold tables through to the reportproperties.sqlEndpointProperties.connectionString on the lakehouse response; never hardcode itSuccess before connecting; newly created lakehouses may take minutes to provisionWhen a user requests a pipeline for the medallion flow, guide LLM to design with:
For detailed Spark configurations and optimization strategies, see data-engineering-patterns.md.
| Layer | Profile | Key Settings |
|---|---|---|
| Bronze | Write-heavy | Disable V-Order, enable autoCompact, large file targets, partition by ingestion_date |
| Silver | Balanced | Enable V-Order, adaptive query execution, partition by business date, ZORDER on filtered columns |
| Gold | Read-heavy | V-Order (spark.sql.parquet.vorder.default=true), Optimize Write (optimizeWrite.enabled=true, binSize=1g), vectorized readers, adaptive execution, ZORDER on all filter columns, pre-aggregate metrics |
Prompt: "Set up medallion architecture with separate Bronze, Silver, and Gold workspaces for sales analytics"
What the LLM should generate: REST API calls to:
sales-bronze-dev, sales-silver-dev, sales-gold-devsales_bronze, sales_silver, sales_gold# Workspace creation (see COMMON-CLI.md for full patterns)
cat > /tmp/body.json << 'EOF'
{"displayName": "sales-analytics-dev"}
EOF
workspace_id=$(az rest --method post --resource "https://api.fabric.microsoft.com" \
--url "https://api.fabric.microsoft.com/v1/workspaces" \
--body @/tmp/body.json --query "id" --output tsv)
# Create Bronze lakehouse
cat > /tmp/body.json << 'EOF'
{"displayName": "sales_bronze", "type": "Lakehouse"}
EOF
az rest --method post --resource "https://api.fabric.microsoft.com" \
--url "https://api.fabric.microsoft.com/v1/workspaces/$workspace_id/items" \
--body @/tmp/body.json
Prompt: "Ingest daily CSV files into bronze lakehouse with metadata columns"
What the LLM should generate: PySpark notebook that:
ingestion_timestamp, source_file, batch_id columns# Bronze ingestion pattern (guide LLM to generate full implementation)
from pyspark.sql.functions import current_timestamp, input_file_name, lit
import uuid
batch_id = str(uuid.uuid4())
df = (spark.read.format("csv").option("header", True).load("/Files/landing/daily/")
.withColumn("ingestion_timestamp", current_timestamp())
.withColumn("source_file", input_file_name())
.withColumn("batch_id", lit(batch_id)))
df.write.mode("append").partitionBy("ingestion_date").format("delta").saveAsTable("bronze.events_raw")
Prompt: "Clean bronze data: remove duplicates, filter invalid records, add derived columns, write to silver"
What the LLM should generate: PySpark notebook applying quality rules, schema conformance, and partitioned write with optimization.
Prompt: "Create a pipeline that runs bronze ingestion, then silver transformation, then gold aggregation daily at 2 AM"
What the LLM should generate: Pipeline JSON definition with sequential notebook activities, date parameter, retry logic, and schedule trigger.