From data-agent-kit-starter-pack
Generates or updates Google Cloud Composer orchestration pipelines for data pipelines like dbt, notebooks, Spark jobs, Dataform, Python scripts, and BigQuery SQL. Deploys and triggers pipelines.
npx claudepluginhub gemini-cli-extensions/data-agent-kit-starter-pack --plugin data-agent-kit-starter-packThis skill uses the workspace's default tool permissions.
If relevant, call the associated reference file(s) before you take actions.
Implements structured self-debugging workflow for AI agent failures: capture errors, diagnose patterns like loops or context overflow, apply contained recoveries, and generate introspection reports.
Monitors deployed URLs for regressions in HTTP status, console errors, performance metrics, content, network, and APIs after deploys, merges, or upgrades.
Provides React and Next.js patterns for component composition, compound components, state management, data fetching, performance optimization, forms, routing, and accessible UIs.
If relevant, call the associated reference file(s) before you take actions. Refer to the table below to determine which reference file to retrieve in different scenarios involving specific functions. [!IMPORTANT]: DO NOT GUESS filenames. You MUST only use the exact paths provided below.
| Function/Use Case | Required Reference File | Capabilities & Intent Keywords |
|---|---|---|
| orchestration-pipelines schema | references/orchestration-pipelines-schema.md | orchestrate, generate, create, update |
Orchestration pipelines require creating two files to ensure a complete and deployable pipeline:
1. `Orchestration File` (e.g., `orchestration-pipeline.yaml`,
`test-pipeline.yaml`): Defines the pipeline's logic, tasks, and
schedule. **IMPORTANT:** Check if a `deployment.yaml` file exists and
references an existing orchestration file. If it does, you **must update
the existing orchestration file** (e.g.,`test_pipeline.yaml`) instead of
creating a new one. The filename can be customized but must be
referenced in the `deployment.yaml` file.
2. `deployment.yaml`: Defines the environment-specific
configurations.(e.g., `dev`, `prod`). `deployment.yaml`should only
exists in the repository root and must be named `deployment.yaml`
All files should always be maintained together. And all files should be placed on the root of the workspace folder.
This skill is helpful to create or update configuration files to orchestrate data pipelines.
Examine the repository's root directory for a deployment.yaml file.
Check for existing setup: The absence of deployment.yaml indicates
that orchestration has not been set up.
Determine if initialization is required: Initialization is required if
deployment.yaml is missing. you MUST run the init command in Step 3
to scaffold the project if deployment.yaml is missing. Do NOT
create the files manually.
Pipeline Name: If initialization is needed, ask the user for the pipeline name. If user hasn't provided the orchestration pipeline name, name should be "orchestration_pipeline"
Environment Name: If initialization is needed, you MUST ask the user for the environment name. If the user does not provide it, use dev as the default.
Execute Initialization: Once you have the pipeline name, run the following command:
# Replace <ORCHESTRATION_PIPELINE_NAME> with the actual name
# Replace <ENV_NAME> with the actual environment name
gcloud beta orchestration-pipelines init <ORCHESTRATION_PIPELINE_NAME> --environment=<ENV_NAME>
*** Pipeline Models (mapping to YAML)
[!IMPORTANT] While the internal pipeline models are defined using protobuf (which typically uses
snake_case), the YAML configuration expectscamelCasefor almost all field names.Mapping Rule: Always convert
snake_caseproto fields (e.g.,pipeline_id) tocamelCasein YAML (e.g.,pipelineId).
Reference to file references/orchestration-pipelines-schema.md.
Required Tags (Top-Level): You MUST add a tags field to the top-level
of the orchestration pipeline YAML definition. The value of this field depends
on the IDE environment:
["job:datacloud:antigravity"].["job:datacloud:vscode"].["job:datacloud:other"].Top-Level Structure: The root of the YAML should be an object with the following fields:
environments (dictionary): A map where keys are environment names (e.g.,
'dev', 'prod', etc) and values are Environment objects.Environment: Each environment object contains the following fields:
project (string): The Google Cloud Project ID.region (string): The Google Cloud region (e.g., 'us-central1').composer_environment (string): The Cloud Composer environment name.artifact_storage
bucket (string): GCS bucketpath_prefix(string): prefix of path that we want to put in bucketpipelines
- source (string): orchestration pipeline yaml file names. It can be
multiplevariables (dictionary, optional): Key-value pairs representing environment
variables. Values can be strings, numbers, or booleans.[!TIP] If the user doesn't provide specific paths for scripts, dbt projects, or GCP details (Project ID, Region), use tools like
find_by_nameto search the repository andgcloudcommands (e.g.,gcloud config get-value project) to retrieve the necessary information.
Before generating, check if an orchestration pipeline definition file and
deployment.yaml already exist in the current directory. If they do, inform
the user and ask if they want to update the existing files or create new
ones with different names. Do not overwrite without confirmation.
First, before creating the orchestration pipeline definition file, you must first run the following command to get the list of available dataproc environments for the user's project. This avoids using placeholder values to run the jobs.
# Replace <PROJECT_ID> with the actual project_id
# Replace <REGION> with the actual region
gcloud dataproc clusters list \
--project <PROJECT_ID> \
--region <REGION> \
[!TIP] Running the command without
--format=yamlprovides a clear, tabular output that is easier to read.
Then use the returned dataproc list with details to create the orchestration
pipeline definition file based on the user's requirements for the pipeline's
logic and schedule. IMPORTANT: Every schedule must include an
endTime. Every schedule must use the current date as startTime if
the user hasn't specified.
[!IMPORTANT] A Composer environment is not a Dataproc cluster. If no Dataproc clusters are available, do not use a Composer environment for the
sparkHistoryServerConfig. It is better to omit this configuration if a dedicated Spark History Server is not available.
If you want to schedule the python job, check the content of Python
content to determine if it's a spark job. If it is, use pyspark as type
instead of script as type.
Before creating or updating the deployment.yaml file, you must first
run the following command to get the list of available Composer environments
for the user's project.
# Replace <PROJECT_ID> with the actual project_id
# Replace <REGION> with the actual region
gcloud composer environments list \
--project <PROJECT_ID> \
--locations <REGION> \
After listing available Composer environments, you must check each environment to ensure the composer is using the right image version or has installed right PyPI packages. Run the following command for each environment:
# Replace <ENVIRONMENT_NAME> with the Composer environment name
# Replace <REGION> with the region
gcloud composer environments describe <ENVIRONMENT_NAME> \
--location <REGION> \
--format="json(config.softwareConfig.imageVersion, config.softwareConfig.pypiPackages)"
From the output, select an environment where the imageVersion value is one of is "composer-3-airflow-3.1.7-build.x, composer-3-airflow-2.11.1-build.x, composer-3-airflow-2.10.5-build.x, composer-3-airflow-2.9.3-build.x, composer-2.16.11-airflow-2.11.1, composer-2.16.11-airflow-2.10.5, composer-2.16.11-airflow-2.9.3" or select an environment whereorchestration-pipelines field is presented listed in the PyPI packages.
This ensures the selected environment is compatible with orchestration pipelines.
Third, before generating the deployment.yaml file, you must ask the
user to provide the artifact_storage bucket name. Note that the
artifact_storage bucket is typically initialized as a placeholder (e.g.,
YOUR_BUCKET) by the init command in Step 1. You must identify any such
placeholders, ask the user for the actual bucket name, and then update the
deployment.yaml file with the provided value.
Use the returned composer list with details, along with the project ID,
region, and the bucket name provided by the user, to generate or update the
deployment.yaml file. When generating or updating the deployment.yaml
file, you must replace placeholders (e.g., "<YOUR_PROJECT_ID>",
"<YOUR_REGION>", "<YOUR_COMPOSER>", "<YOUR_BUCKET>") with the actual
retrieved and provided values. Additionally, you must remove any
associated # TODO: comments once the placeholders are replaced.
Ensure both files adhere to the code structures and syntax specified in this document.
Renaming Pipelines: If requested to change the orchestration pipeline
name, you must rename the orchestration YAML file accordingly (e.g., from
dbt_clean_pipeline.yaml to new_name.yaml) and update the source field
within the pipelines list in deployment.yaml to match the new filename.
[!IMPORTANT]
Time Format: Do NOT include the
Zsuffix instartTimeandendTime. Use the format"YYYY-MM-DDTHH:MM:SS"(e.g.,"2025-10-01T00:00:00").
After creating or editing pipeline files, you MUST validate them using the
gcloud beta orchestration-pipelines validate command. you must: a. Read the
deployment.yaml file to identify all defined environments. b. Run the
validate command below for each environment found in deployment.yaml.
# Replace <ENV_NAME> with the identified environment name
gcloud beta orchestration-pipelines validate --environment=<ENV_NAME>
Check the output of the validation command.
If the command returns an error or failure message:
Re-run the validation command to confirm the fix. Do not mark the task as complete until the validation passes (exit code 0), and do not fall back to create airflow dag in python if validation fails.
When asked to generate or verify declarative pipeline files, ensure they follow these compliant structures. Do not use the exact values below; adapt them to the user's specific project, region, and environment details.
deployment.yaml Template - IMPORTANT FORMAT MUST MATCH-environments:
<environment_name>: # e.g., dev, prod
project: <PROJECT_ID>
region: <REGION>
composer_environment: <COMPOSER_ENVIRONMENT_NAME>
gcs_bucket: "" # Optional
artifact_storage:
bucket: <ARTIFACT_BUCKET_NAME>
path_prefix: "<prefix>-" # e.g., namespace or username prefix
pipelines:
- source: '<orchestration-pipeline.yaml>' # e.g., list of pipeline yaml names
If requested to deploy the orchestration pipeline:
You MUST ask the user which environment to deploy to. If no environment name
is provided, list the available environments from deployment.yaml and ask
the user to choose one, defaulting to dev if it exists.
Read the orchestration YAML to extract the pipelineId.
Deploy with --local. This uploads the DAG without running it:
# Replace <ENV_NAME> with the target environment
# Replace <PIPELINE_SOURCE> with the orchestration YAML filename
gcloud beta orchestration-pipelines deploy \
--environment=<ENV_NAME> --local
Parse the deploy output to extract the bundle ID (version). The output
includes a line like: Pipeline deployment successful for version local-b32d15e307b5 The version string (e.g., local-b32d15e307b5) is the
bundle ID.
[!IMPORTANT]
--localdeployments now default to--paused=true. The deployed DAG will be visible in Airflow as a paused DAG without a schedule. It will not auto-run. Use Step 7 to trigger it.
If requested to trigger/run the orchestration pipeline, you MUST follow the Deploy → Poll → Trigger flow.
Ask for environment: You MUST ask the user which environment to use.
Default to dev if it exists in deployment.yaml.
Deploy first (Step 6): Always deploy before triggering to ensure the run
uses the latest code. Extract the bundle ID from deploy output and the
pipelineId from the orchestration YAML.
Poll for DAG readiness: Wait for the DAG to be registered in Composer.
# Initial delay: wait 30 seconds after deploy
sleep 30
# Poll every 15 seconds, up to 2 minutes total
# Replace <ENV_NAME>, <BUNDLE_ID> with actual values
gcloud beta orchestration-pipelines list \
--environment=<ENV_NAME> \
--bundle=<BUNDLE_ID>
The pipeline is ready when it appears in the list output. If it does not appear after 2 minutes, report failure and advise the user to check YAML validity.
Trigger the pipeline:
# Replace <ENV_NAME>, <BUNDLE_ID>, <PIPELINE_ID> with actual values
gcloud beta orchestration-pipelines trigger \
--environment=<ENV_NAME> \
--bundle=<BUNDLE_ID> \
--pipeline=<PIPELINE_ID>
gcloud beta orchestration-pipelines runs
list \ --environment=<ENV_NAME> \ --bundle=<BUNDLE_ID> \
--pipeline=<PIPELINE_ID>`
[!TIP] Trigger-only (no deploy): If the user wants to trigger an already-deployed pipeline, skip Step 6. Use
gcloud beta orchestration-pipelines list --environment=<ENV_NAME>to find the bundle ID, then trigger directly with Step 7.4.
[!IMPORTANT] Fallback: If
gcloud triggerfails, use the bundled script: Run script with -- help to discover and learn the interface.
python scripts/trigger/airflow_trigger.py \ --project <PROJECT_ID>
--location <REGION> \ --environment <COMPOSER_ENV> --dag_id <PIPELINE_ID>
Get
project,region, andcomposer_environmentfromdeployment.yaml.
deployment.yaml file is created successfully.orchestration_pipeline.yaml)
is created successfully, includes a mandatory endTime for every schedule,
and passes the validation command: gcloud beta orchestration-pipelines validate --environment=<ENV_NAME>gcloud beta orchestration-pipelines deploy --environment=<ENV_NAME> --local command
should return a success message with a version/bundle ID.gcloud beta orchestration-pipelines list within 2 mingcloud beta orchestration-pipelines trigger returned successgcloud beta orchestration-pipelines runs listIf requested to pause/stop the orchestration pipeline, use
# Replace <ENV_NAME>, <BUNDLE_ID>, <PIPELINE_ID> with actual values
gcloud beta orchestration-pipelines pause \
--environment=<ENV_NAME> \
--bundle=<BUNDLE_ID> \
--pipeline=<PIPELINE_ID>
If requested to unpause/resume the orchestration pipeline, use
# Replace <ENV_NAME>, <BUNDLE_ID>, <PIPELINE_ID> with actual values
gcloud beta orchestration-pipelines unpause \
--environment=<ENV_NAME> \
--bundle=<BUNDLE_ID> \
--pipeline=<PIPELINE_ID>