Skill

gcp-spark

Develops and executes PySpark code on GCP Dataproc clusters/serverless for ETL pipelines and ML tasks. Reads/writes BigLake Iceberg, BigQuery, Spanner. Debugs failures.

Python

GCP

data-engineering

ai-ml

npx claudepluginhub gemini-cli-extensions/data-agent-kit-starter-pack --plugin data-agent-kit-starter-pack

Tool Access

This skill uses the workspace's default tool permissions.

Preview

> [!IMPORTANT] You MUST ALWAYS follow the Task Execution Workflow when writing

Supporting Assets

resources/gcloud_dataproc.mdresources/ml_tasks.mdresources/read_write_data.mdresources/schema_direct_inspection.mdresources/spark_optimizations.md

SKILL.md

Similar Skills

agent-introspection-debugging

169.4k

Implements structured self-debugging workflow for AI agent failures: capture errors, diagnose patterns like loops or context overflow, apply contained recoveries, and generate introspection reports.

everything-claude-code

canary-watch

169.4k

Monitors deployed URLs for regressions in HTTP status, console errors, performance metrics, content, network, and APIs after deploys, merges, or upgrades.

everything-claude-code

frontend-patterns

169.4k

Provides React and Next.js patterns for component composition, compound components, state management, data fetching, performance optimization, forms, routing, and accessible UIs.

everything-claude-code

Stats

Stars30

Forks2

Last CommitApr 15, 2026

Actions

View Source View Plugin View on GitHub View README

Spark on Dataproc

[!IMPORTANT] You MUST ALWAYS follow the Task Execution Workflow when writing spark code.

Task Execution Workflow

Understand schemas: ALWAYS use @skill:discovering-gcp-data-assets skill or resources/schema_direct_inspection.md to understand input and output schemas. Include the schema in your thought process BEFORE generating any code. Do NOT guess column names.
Generate spark code:
- Output Format: ALWAYS generate code in Python Notebooks (.ipynb) format. Generate scripts (.py) only if explicitly requested.
- Read and Write data: ALWAYS Refer to resources/read_write_data.md when reading or writing data.
- ML Tasks: Refer to @skill:ml-best-practices skill and resources/ml_tasks.md when generating ML code.
- Spark Optimizations: ALWAYS refer to resources/spark_optimizations.md when generating spark code and apply optimization whenever applicable.
Verify schema before write: ALWAYS verify that the dataframe and destination schema match, use df.printSchema() for dataframe schema and refer to @skill:discovering-gcp-data-assets skill or resources/schema_direct_inspection.md to verify destination schema.
Compile code before executing: For notebooks convert them to python script using jupyter nbconvert --to script your-notebook.ipynb first, then compile code using python3 -m py_compile your-notebook.py.
Execute script: ONLY when generating a .py script refer to resources/gcloud_dataproc.md on writing command to execute generated code on Dataproc. This DOES NOT apply when generating notebooks.

Common Mistakes Checklist

[!CAUTION] Ensure you verify this checklist to avoid mistakes

Before submitting a job, verify:

All imports present (col, when, lit, etc. from pyspark.sql.functions)
vector_to_array from correct module use from pyspark.ml.functions import vector_to_array (NOT pyspark.sql.functions)
DataFrame schema matches target Iceberg table verify with df.printSchema() before writing
CSV files read with header and inferSchema without these, the header row becomes data and all columns are strings
Avoid toPandas() Converting a pyspark dataframe to pandas by calling toPandas() can lead to out of memory errors. Only acceptable for building visualizations in Spark 3.5

IAM Requirements

The Dataproc service account needs:

roles/dataproc.worker: Job execution
roles/biglake.admin: Iceberg table management
roles/bigquery.jobUser: Query materialization
roles/storage.objectUser: Read/write GCS
roles/spanner.databaseUser: Spanner writes

Spark resource management

Refer to resources/gcloud_dataproc.md for detailed guidelines on managing Spark clusters, jobs, batches, and interactive sessions.