From data-agent-kit-starter-pack
Develops and executes PySpark code on GCP Dataproc clusters/serverless for ETL pipelines and ML tasks. Reads/writes BigLake Iceberg, BigQuery, Spanner. Debugs failures.
npx claudepluginhub gemini-cli-extensions/data-agent-kit-starter-pack --plugin data-agent-kit-starter-packThis skill uses the workspace's default tool permissions.
> [!IMPORTANT] You MUST ALWAYS follow the Task Execution Workflow when writing
Implements structured self-debugging workflow for AI agent failures: capture errors, diagnose patterns like loops or context overflow, apply contained recoveries, and generate introspection reports.
Monitors deployed URLs for regressions in HTTP status, console errors, performance metrics, content, network, and APIs after deploys, merges, or upgrades.
Provides React and Next.js patterns for component composition, compound components, state management, data fetching, performance optimization, forms, routing, and accessible UIs.
[!IMPORTANT] You MUST ALWAYS follow the Task Execution Workflow when writing spark code.
@skill:discovering-gcp-data-assets
skill or resources/schema_direct_inspection.md to understand input and
output schemas. Include the schema in your thought process BEFORE generating
any code. Do NOT guess column names.resources/read_write_data.md when reading or writing data.@skill:ml-best-practices skill and
resources/ml_tasks.md when generating ML code.resources/spark_optimizations.md when generating spark code and apply
optimization whenever applicable.df.printSchema() for dataframe schema and
refer to @skill:discovering-gcp-data-assets skill or
resources/schema_direct_inspection.md to verify destination schema.jupyter nbconvert --to script your-notebook.ipynb first, then
compile code using python3 -m py_compile your-notebook.py..py script refer to
resources/gcloud_dataproc.md on writing command to execute generated code
on Dataproc. This DOES NOT apply when generating notebooks.[!CAUTION] Ensure you verify this checklist to avoid mistakes
Before submitting a job, verify:
col, when, lit, etc. from
pyspark.sql.functions)vector_to_array from correct module use from pyspark.ml.functions import vector_to_array (NOT pyspark.sql.functions)df.printSchema() before writingheader and inferSchema without these, the
header row becomes data and all columns are stringsThe Dataproc service account needs:
roles/dataproc.worker: Job executionroles/biglake.admin: Iceberg table managementroles/bigquery.jobUser: Query materializationroles/storage.objectUser: Read/write GCSroles/spanner.databaseUser: Spanner writesRefer to resources/gcloud_dataproc.md for detailed guidelines on managing
Spark clusters, jobs, batches, and interactive sessions.