Skill

bauplan-data-pipeline

Creates bauplan data pipeline projects with SQL and Python models for DAG transformations, new pipelines, model writing, and project setup from scratch.

Python

SQL

data-engineering

npx claudepluginhub bauplanlabs/bauplan-skills --plugin bauplan

Tool Access

This skill is limited to using the following tools:

Bash(bauplan:*)ReadWriteGlobGrepWebFetch(domain:docs.bauplanlabs.com)

Preview

This skill guides you through creating a new bauplan data pipeline project from scratch, including the project configuration and transformation models.

Supporting Assets

examples.md

SKILL.md

Similar Skills

bauplan-explore-data

Explores Bauplan lakehouse data using Python SDK: inspect namespaces, tables, schemas, samples, profiling queries; export results to files. Read-only, phased execution produces summary.md.

6 tools

bauplan

dataform-bigquery

Generates clean, efficient Dataform code for BigQuery ELT pipelines including actions, SQLX transformations, source declarations, GCS ingestion, project init, and workflow_settings.yaml.

data-agent-kit-starter-pack

databricks-pipelines

110

Develops Lakeflow Spark Declarative Pipelines on Databricks for batch and streaming data pipelines using Python or SQL. Guides dataset types like Streaming Tables and features like Auto Loader, Auto CDC via decision tree.

20 files

databricks-skills

Stats

Parent Repo Stars10

Parent Repo Forks0

Last CommitApr 3, 2026

Actions

View Source View Plugin View on GitHub View README

Help us improve

Share bugs, ideas, or general feedback.

Creating a New Bauplan Data Pipeline

This skill guides you through creating a new bauplan data pipeline project from scratch, including the project configuration and transformation models.

CRITICAL: Branch Safety

NEVER run pipelines on main branch. ALWAYS use a separate data branch for development.

Branch naming convention: <username>.<branch_name> (e.g., john.feature-pipeline). Get your username by running bauplan info.

Environment Setup

Before writing pipeline code, check whether the project uses uv (look for pyproject.toml or uv.lock). If so, use uv run to execute commands and uv add to install packages. Otherwise, use pip install.

Ensure bauplan is installed — it provides both the SDK and the bauplan CLI. Verify with bauplan info. Models declare their own runtime dependencies via @bauplan.python('3.11', pip={...}), but the local environment needs bauplan to run bauplan run.

Table References

Source tables must already exist in the bauplan lakehouse before building a pipeline.
Verify source tables exist and understand their schema before writing any code.
Always use fully-qualified names: <namespace>.<table_name> (e.g., bauplan.taxi_fhvhv).
Default namespace: bauplan.

Required User Input

Before writing a pipeline, you MUST gather the following from the user:

Pipeline purpose (required): What transformations should the DAG perform? What is the business logic or goal?
Source tables (required): Which tables from the lakehouse should be used as inputs? Verify they exist with bauplan table get <namespace>.<table_name>.
Output tables (required): Which tables should be materialized as final outputs?
Materialization strategy (optional, default: REPLACE): Should output tables use REPLACE or APPEND?
Strict mode (optional): Should the pipeline run in strict mode?

If any required item is missing, ask the user before writing any code.

Pipelines

A bauplan pipeline is a DAG of functions (models). Key concepts:

Models: Python functions that transform data.
Source Tables: Existing lakehouse tables: the entry points to your DAG.
Inputs: Each Model can take multiple tables via bauplan.Model() references, either from the outputs of previous Models or as Source Tables.
Outputs: Each Model produces exactly one table. The output name is the function name (def clean_trips() → clean_trips).
Topology: Implicitly defined by input references — Bauplan determines the execution order.
Expectations: Data quality functions that take tables as input and return a boolean. Generated by the bauplan-data-quality-checks skill — do not write expectations directly in this skill.

Example DAG

[lakehouse: taxi_fhvhv] ──→ [trips] ──→ [clean_trips] ──→ [daily_summary]
                                              ↑
[lakehouse: taxi_zones] ──────────────────────┘

taxi_fhvhv and taxi_zones are Source Tables (already in lakehouse)
trips is a Python Model reading from taxi_fhvhv (single input)
clean_trips is a Python Model taking trips and taxi_zones as inputs (multiple inputs)
daily_summary is a Python Model taking clean_trips as input (single input)

Python Models

Python models are the preferred way to write all transformations. They are Python functions registered with decorators.

Base Python Model

# import bauplan globally, but DO NOT import other packages at the top level
import bauplan

@bauplan.model(
    # declare expected output columns for validation
    columns=['pickup_datetime', 'PULocationID', 'trip_miles'],
    # persist output as an Iceberg table; omit for intermediate models
    materialization_strategy='REPLACE'
)
# specify Python version and dependencies; prefer Polars or DuckDB over Pandas
@bauplan.python('3.11', pip={'polars': '1.15.0'})
def clean_trips(
    # use columns and filter for efficient I/O pushdown
    data=bauplan.Model(
        'trips',
        columns=['pickup_datetime', 'PULocationID', 'trip_miles'],
        filter="trip_miles > 0"
    )
):
    """
    Filters trips to include only those with positive mileage.

    | pickup_datetime     | PULocationID | trip_miles |
    |---------------------|--------------|------------|
    | 2022-12-01 08:00:00 | 123          | 5.2        |
    """
    # import dependencies inside the function — each model runs in its own environment
    import polars as pl

    df = pl.from_arrow(data)
    df = df.filter(pl.col('trip_miles') > 0.0)
    return df.to_arrow()

Python Model with Multiple Inputs

Models can take multiple tables as input — add more bauplan.Model() parameters. See examples.md for complete examples.

Best Practices

Output Columns Validation

Whenever possible, specify columns in @bauplan.model() to define the expected output schema. This enables automatic validation of your model's output. Check the schema of your source tables first, then declare output columns based on what your transformation actually produces.

Docstrings with Output Schema

Every Python model should have a docstring describing the transformation and showing the output table structure as an ASCII table. If the table is too wide, show only key columns; if values are too large, truncate them.

I/O Pushdown with `columns` and `filter`

Use columns and filter in bauplan.Model() to restrict the data read at the storage level. Do not read columns you don't need. This enables I/O pushdown, meaning Bauplan filters data at the Iceberg layer before it reaches your function. On large tables, this can reduce data transfer by orders of magnitude.

columns: list only the columns your model actually needs.
filter: SQL-like expression to restrict rows (e.g., filter="price > 0").

See examples.md for a complete guide.

Use Polars or DuckDB, Not Pandas

Use Polars or DuckDB for data processing inside Python models. Do not use Pandas. Bauplan models receive and return Apache Arrow tables. Polars and DuckDB operate natively on Arrow with zero-copy reads and multi-threaded execution. Pandas requires a full data copy into its own format — slower, single-threaded, and uses significantly more memory.

Polars: best for DataFrame-style transformations (filter, join, group_by, with_columns). Convert with pl.from_arrow(data) on input, result.to_arrow() on output.
DuckDB: best when the logic is naturally expressed as SQL (complex joins, aggregations). Register the Arrow table directly with con.register("name", data).

Note: client.query() in the SDK returns a PyArrow table directly — no .to_arrow() needed. Inside models, the data parameter is also already an Arrow table. The only place you call .to_arrow() is when converting a Polars DataFrame back to Arrow for the model return value.

Project Structure

A bauplan project is a folder containing:

my-project/
  bauplan_project.yml    # Required: project configuration
  models.py              # Python models (one file can contain the entire pipeline)
  expectations.py        # Optional: generated by the bauplan-data-quality-checks skill

`bauplan_project.yml`

Every project requires this configuration file:

project:
  id: <unique-uuid>       # Generate with: python3 -c "import uuid; print(uuid.uuid4())"
  name: <project_name>    # Descriptive name for the project

Running Pipelines

Always run from inside the project directory (the folder containing bauplan_project.yml).

Dry Run

A dry run validates the pipeline without materializing any tables. It checks that the DAG is valid, source tables exist, SQL parses correctly, and declared output columns match. Always dry-run before a full run.

bauplan run --dry-run

If the dry run fails, read the error output, fix the issue, and dry-run again. Don't proceed to a full run until the dry run passes.

Full Run

bauplan run

This executes the pipeline and materializes all models that have a materialization_strategy set. After a successful run, verify the output:

bauplan table get <namespace>.<output_table>
bauplan query "SELECT * FROM <namespace>.<output_table> LIMIT 5"

Strict Mode

Append --strict to catch declaration errors early. In strict mode, the run fails immediately on output column mismatches or expectation failures instead of logging a warning and continuing. Recommended during development.

bauplan run --dry-run --strict
bauplan run --strict

Materialization Checklist

After writing models, verify each model has the correct materialization_strategy:

Intermediate models (not final outputs): no materialization_strategy → @bauplan.model()
Final output tables: @bauplan.model(materialization_strategy='REPLACE') or 'APPEND'

Workflow Checklist

SQL Models (Optional)

Bauplan also supports SQL models for simple reads from source tables. SQL models are .sql files where the filename becomes the output table name and the FROM clause defines the inputs.

-- trips.sql
SELECT
    pickup_datetime,
    PULocationID,
    trip_miles
FROM taxi_fhvhv
WHERE pickup_datetime >= '2022-12-01'

Output table: trips (from filename). Add -- bauplan: materialization_strategy=REPLACE as a comment to materialize.

Limitations:

SQL models should only be used as the first node in a pipeline, reading directly from source tables.
They lack output column validation, docstring documentation, and the I/O pushdown controls available in Python models.
For all downstream transformations, use Python models.

Advanced Examples

See examples.md for:

APPEND materialization strategy
DuckDB queries in Python models
Data quality expectations
Multi-stage pipelines

Reference

When unsure about a method signature, CLI flag, or concept, fetch the relevant doc page via WebFetch rather than guessing. Pages are markdown and LLM-friendly.

Python SDK: https://docs.bauplanlabs.com/reference/bauplan.md Standard expectations: https://docs.bauplanlabs.com/reference/bauplan-standard-expectations.md

Relevant concept pages:

Models: https://docs.bauplanlabs.com/concepts/models.md
Pipelines: https://docs.bauplanlabs.com/concepts/pipelines.md
Projects: https://docs.bauplanlabs.com/concepts/projects.md
Expectations: https://docs.bauplanlabs.com/concepts/expectations.md
Execution model: https://docs.bauplanlabs.com/overview/execution-model.md
Parameters: https://docs.bauplanlabs.com/concepts/pipelines.md

Full doc index: https://docs.bauplanlabs.com/llms.txt

CLI: The bauplan CLI is self-documenting:

bauplan --help — lists all available commands
bauplan <command> --help — shows arguments and options for a specific command (e.g., bauplan run --help, bauplan table --help)

Validating generated Python: After writing or updating models.py or expectations.py, run ruff check and ruff format to catch syntax errors and style issues, and ty to catch type errors — these verify the code compiles and the SDK calls are well-formed without executing it. Only run these if they are installed (check with which ruff / which ty). This is a fast sanity check before bauplan run --dry-run.

bauplan-data-pipeline

Tool Access

Preview

Supporting Assets

SKILL.md

Similar Skills

Help us improve

Help us improve

bauplan-data-pipeline

Tool Access

Preview

Supporting Assets

SKILL.md

Creating a New Bauplan Data Pipeline

CRITICAL: Branch Safety

Environment Setup

Table References

Required User Input

Pipelines

Example DAG

Python Models

Base Python Model

Python Model with Multiple Inputs

Best Practices

Output Columns Validation

Docstrings with Output Schema

I/O Pushdown with columns and filter

Use Polars or DuckDB, Not Pandas

Project Structure

bauplan_project.yml

Running Pipelines

Dry Run

Full Run

Strict Mode

Materialization Checklist

Workflow Checklist

SQL Models (Optional)

Advanced Examples

Reference

Similar Skills

Help us improve

Creating a New Bauplan Data Pipeline

CRITICAL: Branch Safety

Environment Setup

Table References

Required User Input

Pipelines

Example DAG

Python Models

Base Python Model

Python Model with Multiple Inputs

Best Practices

Output Columns Validation

Docstrings with Output Schema

I/O Pushdown with columns and filter

Use Polars or DuckDB, Not Pandas

Project Structure

bauplan_project.yml

Running Pipelines

Dry Run

Full Run

Strict Mode

Materialization Checklist

Workflow Checklist

SQL Models (Optional)

Advanced Examples

Reference

I/O Pushdown with `columns` and `filter`

`bauplan_project.yml`

I/O Pushdown with `columns` and `filter`

`bauplan_project.yml`