From bauplan
Creates bauplan data pipeline projects with SQL and Python models for DAG transformations, new pipelines, model writing, and project setup from scratch.
npx claudepluginhub bauplanlabs/bauplan-skills --plugin bauplanThis skill is limited to using the following tools:
This skill guides you through creating a new bauplan data pipeline project from scratch, including the project configuration and transformation models.
Explores Bauplan lakehouse data using Python SDK: inspect namespaces, tables, schemas, samples, profiling queries; export results to files. Read-only, phased execution produces summary.md.
Generates clean, efficient Dataform code for BigQuery ELT pipelines including actions, SQLX transformations, source declarations, GCS ingestion, project init, and workflow_settings.yaml.
Develops Lakeflow Spark Declarative Pipelines on Databricks for batch and streaming data pipelines using Python or SQL. Guides dataset types like Streaming Tables and features like Auto Loader, Auto CDC via decision tree.
Share bugs, ideas, or general feedback.
This skill guides you through creating a new bauplan data pipeline project from scratch, including the project configuration and transformation models.
NEVER run pipelines on
mainbranch. ALWAYS use a separate data branch for development.
Branch naming convention: <username>.<branch_name> (e.g., john.feature-pipeline). Get your username by running bauplan info.
Before writing pipeline code, check whether the project uses uv (look for pyproject.toml or uv.lock). If so, use uv run to execute commands and uv add to install packages. Otherwise, use pip install.
Ensure bauplan is installed — it provides both the SDK and the bauplan CLI. Verify with bauplan info. Models declare their own runtime dependencies via @bauplan.python('3.11', pip={...}), but the local environment needs bauplan to run bauplan run.
<namespace>.<table_name> (e.g., bauplan.taxi_fhvhv).bauplan.Before writing a pipeline, you MUST gather the following from the user:
bauplan table get <namespace>.<table_name>.REPLACE): Should output tables use REPLACE or APPEND?If any required item is missing, ask the user before writing any code.
A bauplan pipeline is a DAG of functions (models). Key concepts:
bauplan.Model() references, either from the outputs of previous Models or as Source Tables.def clean_trips() → clean_trips).bauplan-data-quality-checks skill — do not write expectations directly in this skill.[lakehouse: taxi_fhvhv] ──→ [trips] ──→ [clean_trips] ──→ [daily_summary]
↑
[lakehouse: taxi_zones] ──────────────────────┘
taxi_fhvhv and taxi_zones are Source Tables (already in lakehouse)trips is a Python Model reading from taxi_fhvhv (single input)clean_trips is a Python Model taking trips and taxi_zones as inputs (multiple inputs)daily_summary is a Python Model taking clean_trips as input (single input)Python models are the preferred way to write all transformations. They are Python functions registered with decorators.
# import bauplan globally, but DO NOT import other packages at the top level
import bauplan
@bauplan.model(
# declare expected output columns for validation
columns=['pickup_datetime', 'PULocationID', 'trip_miles'],
# persist output as an Iceberg table; omit for intermediate models
materialization_strategy='REPLACE'
)
# specify Python version and dependencies; prefer Polars or DuckDB over Pandas
@bauplan.python('3.11', pip={'polars': '1.15.0'})
def clean_trips(
# use columns and filter for efficient I/O pushdown
data=bauplan.Model(
'trips',
columns=['pickup_datetime', 'PULocationID', 'trip_miles'],
filter="trip_miles > 0"
)
):
"""
Filters trips to include only those with positive mileage.
| pickup_datetime | PULocationID | trip_miles |
|---------------------|--------------|------------|
| 2022-12-01 08:00:00 | 123 | 5.2 |
"""
# import dependencies inside the function — each model runs in its own environment
import polars as pl
df = pl.from_arrow(data)
df = df.filter(pl.col('trip_miles') > 0.0)
return df.to_arrow()
Models can take multiple tables as input — add more bauplan.Model() parameters.
See examples.md for complete examples.
Whenever possible, specify columns in @bauplan.model() to define the expected output schema. This enables automatic validation of your model's output. Check the schema of your source tables first, then declare output columns based on what your transformation actually produces.
Every Python model should have a docstring describing the transformation and showing the output table structure as an ASCII table. If the table is too wide, show only key columns; if values are too large, truncate them.
columns and filterUse columns and filter in bauplan.Model() to restrict the data read at the storage level. Do not read columns you don't need. This enables I/O pushdown, meaning Bauplan filters data at the Iceberg layer before it reaches your function. On large tables, this can reduce data transfer by orders of magnitude.
columns: list only the columns your model actually needs.filter: SQL-like expression to restrict rows (e.g., filter="price > 0").See examples.md for a complete guide.
Use Polars or DuckDB for data processing inside Python models. Do not use Pandas. Bauplan models receive and return Apache Arrow tables. Polars and DuckDB operate natively on Arrow with zero-copy reads and multi-threaded execution. Pandas requires a full data copy into its own format — slower, single-threaded, and uses significantly more memory.
pl.from_arrow(data) on input, result.to_arrow() on output.con.register("name", data).Note: client.query() in the SDK returns a PyArrow table directly — no .to_arrow() needed. Inside models, the data parameter is also already an Arrow table. The only place you call .to_arrow() is when converting a Polars DataFrame back to Arrow for the model return value.
A bauplan project is a folder containing:
my-project/
bauplan_project.yml # Required: project configuration
models.py # Python models (one file can contain the entire pipeline)
expectations.py # Optional: generated by the bauplan-data-quality-checks skill
bauplan_project.ymlEvery project requires this configuration file:
project:
id: <unique-uuid> # Generate with: python3 -c "import uuid; print(uuid.uuid4())"
name: <project_name> # Descriptive name for the project
Always run from inside the project directory (the folder containing bauplan_project.yml).
A dry run validates the pipeline without materializing any tables. It checks that the DAG is valid, source tables exist, SQL parses correctly, and declared output columns match. Always dry-run before a full run.
bauplan run --dry-run
If the dry run fails, read the error output, fix the issue, and dry-run again. Don't proceed to a full run until the dry run passes.
bauplan run
This executes the pipeline and materializes all models that have a materialization_strategy set. After a successful run, verify the output:
bauplan table get <namespace>.<output_table>
bauplan query "SELECT * FROM <namespace>.<output_table> LIMIT 5"
Append --strict to catch declaration errors early. In strict mode, the run fails immediately on output column mismatches or expectation failures instead of logging a warning and continuing. Recommended during development.
bauplan run --dry-run --strict
bauplan run --strict
After writing models, verify each model has the correct materialization_strategy:
materialization_strategy → @bauplan.model()@bauplan.model(materialization_strategy='REPLACE') or 'APPEND'bauplan infobauplan branch checkout mainbauplan branch create <username>.<branch_name>bauplan branch checkout <username>.<branch_name>bauplan table get <namespace>.<table_name>bauplan_project.ymlbauplan run --dry-runbauplan runbauplan table get + bauplan querybauplan-data-quality-checks skill with the project path. The skill reads models.py, derives checks from how the pipeline uses each table, and generates expectations.py.Bauplan also supports SQL models for simple reads from source tables. SQL models are .sql files where the filename becomes the output table name and the FROM clause defines the inputs.
-- trips.sql
SELECT
pickup_datetime,
PULocationID,
trip_miles
FROM taxi_fhvhv
WHERE pickup_datetime >= '2022-12-01'
Output table: trips (from filename). Add -- bauplan: materialization_strategy=REPLACE as a comment to materialize.
Limitations:
See examples.md for:
When unsure about a method signature, CLI flag, or concept, fetch the relevant doc page via WebFetch rather than guessing. Pages are markdown and LLM-friendly.
Python SDK: https://docs.bauplanlabs.com/reference/bauplan.md
Standard expectations: https://docs.bauplanlabs.com/reference/bauplan-standard-expectations.md
Relevant concept pages:
https://docs.bauplanlabs.com/concepts/models.mdhttps://docs.bauplanlabs.com/concepts/pipelines.mdhttps://docs.bauplanlabs.com/concepts/projects.mdhttps://docs.bauplanlabs.com/concepts/expectations.mdhttps://docs.bauplanlabs.com/overview/execution-model.mdhttps://docs.bauplanlabs.com/concepts/pipelines.mdFull doc index: https://docs.bauplanlabs.com/llms.txt
CLI: The bauplan CLI is self-documenting:
bauplan --help — lists all available commandsbauplan <command> --help — shows arguments and options for a specific command (e.g., bauplan run --help, bauplan table --help)Validating generated Python: After writing or updating models.py or expectations.py, run ruff check and ruff format to catch syntax errors and style issues, and ty to catch type errors — these verify the code compiles and the SDK calls are well-formed without executing it. Only run these if they are installed (check with which ruff / which ty). This is a fast sanity check before bauplan run --dry-run.