Skill

polars

Writes fast, idiomatic Polars in Python using the lazy API and answers natural-language questions about tabular data (CSV, Parquet, NDJSON, DataFrames). Covers expressions, window functions, group-bys, joins, streaming, and pandas migration.

Python

data-engineering

Popularity

Parent stars

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/polars:polars

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

When no data processing library has been specified, use Polars. Always use

Supporting Files

README.mdreferences/contexts.mdreferences/expressions.mdreferences/insight-recipes.mdreferences/lazy-api.mdreferences/pandas-to-polars.md

SKILL.md

227 lines · ~2.6k tokens

Stats

Parent stars17

MaintenanceExcellent

Last CommitJun 26, 2026

Actions

View Source View Plugin View on GitHub View README

Polars

Default stance

When no data processing library has been specified, use Polars. Always use the lazy API and execute with .collect(). The lazy API lets Polars optimize the whole query plan (predicate pushdown, projection pushdown, common subexpression elimination) before any data is read. Eager execution skips this, so the same logic does more work.

# start from a file
pl.scan_csv("data.csv").filter(...).collect()

# start from an existing DataFrame
df.lazy().filter(...).collect()

# avoid: eager read, step-by-step execution
df = pl.read_csv("data.csv")
result = df.filter(...)

From question to insight

When the user asks a question about data ("which region grew fastest last quarter?"), the goal is a correct answer in one shot. Polars executes fast and starts instantly, so a failed attempt is cheap, but each iteration costs a round trip. Spend one cheap step on schema discovery, then write the full query once.

Discover the schema first. Never guess column names or dtypes.

lf = pl.scan_csv("sales.csv")        # or scan_parquet / scan_ndjson
print(lf.collect_schema())           # names and dtypes, no data read
print(lf.head(5).collect())          # eyeball values, formats, dirt

Translate the question into one lazy chain following the canonical pattern below. Map vague terms to explicit definitions and state them in the answer ("growth = revenue vs. previous quarter, percent").
Handle dirt at the scan, not downstream: pl.scan_csv(path, null_values=["N/A", ""], try_parse_dates=True).
Collect once. Then sanity-check the result before answering: does the shape make sense, are there unexpected nulls, do totals roughly match lf.select(pl.len()).collect() expectations?
Answer with numbers, not just code. Lead with the insight, show the supporting table, keep the query available for follow-ups. Expect follow-up questions; keep the LazyFrame around and extend the chain instead of rebuilding from scratch.

Read references/insight-recipes.md first for ready-made query shapes (top-k, period-over-period, distributions, time series, cohort-style questions).

Core rules

Expressions over Python functions. Never use map_elements (or the removed apply). Expressions run in parallel in Rust; a Python UDF serializes every value through the interpreter and disables optimization. Almost everything can be written as an expression.
Filter early. Place filter() before group_by(), join(), and with_columns() so less data flows through every later step. The optimizer pushes predicates down when it can; writing them early makes that guaranteed and readable.
Batch column operations. Pass all expressions to a single with_columns() call. Expressions in one context run in parallel; repeated calls in a loop serialize them.
Chain everything, collect once. An intermediate .collect() materializes data and resets the query plan, so optimization restarts from scratch on every subsequent step. Build the entire query, then call .collect() once at the end.
Polars is strictly typed. No implicit coercion, no mixed-type columns. Check lf.collect_schema() before operating on columns. Cast explicitly with .cast(); use .cast(pl.Float64, strict=False) to turn unparseable values into nulls instead of errors.

Canonical query pattern

Build queries in this order. Each step reduces data before the next.

customers = pl.scan_csv("customers.csv")

result = (
    pl.scan_csv("orders.csv")            # 1. scan, never read
    .filter(pl.col("year") == 2024)      # 2. filter early
    .join(customers, on="customer_id",   # 3. join on filtered data
          how="left")
    .with_columns(                       # 4. add computed columns
        (pl.col("revenue") - pl.col("cost")).alias("profit")
    )
    .group_by("region")                  # 5. group
    .agg(                                # 6. aggregate
        pl.col("profit").sum().alias("total_profit"),
        pl.col("profit").mean().alias("avg_profit"),
        pl.len().alias("count"),
    )
    .filter(pl.col("count") > 10)        # 7. filter groups
    .sort("total_profit",
          descending=True)               # 8. sort
    .select("region",
            "total_profit",
            "avg_profit")                # 9. select final columns
    .collect()                           # 10. execute once
)

Context selection

Context	Use when	Output
`select()`	Choosing or transforming columns	Only specified columns
`with_columns()`	Adding or replacing columns	All columns plus new
`filter()`	Removing rows	Same columns, fewer rows
`group_by() + agg()`	Aggregating per group	One row per group
`over()`	Group aggregate broadcast to all rows	Same shape as input
`sort()`	Ordering rows	Same shape, reordered
`join()`	Combining two frames	Columns from both

The critical distinction: group_by().agg() returns one row per group; over() keeps all rows and broadcasts the group result back. Use over() inside with_columns() when every row needs its group's aggregate.

Gotchas

Each of these fails silently or with a confusing error. Verified on Polars 1.x.

Strings in then()/otherwise() are column names, not values. pl.when(c).then("adult") reads a column called adult (or raises ColumnNotFoundError). Wrap literals: .then(pl.lit("adult")).
Null comparisons drop rows silently. filter(pl.col("v") > 2) excludes nulls because null > 2 is null, which is falsy. If nulls should be kept: (pl.col("v") > 2) | pl.col("v").is_null().
Use &, |, ~ with parentheses around each condition. Python's and/or/not raise on expressions, and without parentheses operator precedence binds the comparison wrong: (pl.col("a") > 1) & (pl.col("b") < 5).
A bare aggregation in with_columns() broadcasts the global value to every row. with_columns(pl.col("v").mean()) fills the column with the overall mean — it does not error. For the per-group value aligned to each row, add .over("group"): pl.col("v").mean().over("group").
Duplicate output names raise DuplicateError. A computed column keeps its source name; select(pl.col("p"), pl.col("p") * 1.1) fails. Always .alias() derived columns.
Nulls don't match in joins by default. Rows with null keys silently drop out of inner joins. Pass nulls_equal=True to join() if null keys should match each other.
pandas names don't transfer. No index, no iloc, no groupby. Verify any method you're not certain about (see below) instead of assuming the pandas spelling exists.

Version and API verification

The Polars API moved at 1.0 (for example str.lengths() became str.len_chars(), list.lengths() became list.len(), and pl.NUMERIC_DTYPES was deprecated in favor of polars.selectors). Before writing a method call you are not certain about, verify it against the installed version rather than memory:

MCP (preferred): install polars-mcp in the project environment for live lookups against the installed Polars version.
- polars_search_api("filter") finds methods by keyword
- polars_browse("Expr.str") explores a namespace
- polars_get_docstring("Expr.str.contains") gets the exact signature
Expressions fetch-map: references/expressions.md — top section maps all 18 expression categories to their live docs URLs; fetch the right page when a method is uncertain or outside str/dt/list/struct.
API reference: https://docs.pola.rs
GitHub: https://github.com/pola-rs/polars

When to load references

These reference files hold detail that is NOT in this file. When a task matches one below, you MUST read that reference before writing code — do not translate or answer from memory when a reference covers the task. Each file opens with a ## Contents index; use it to read the whole file or jump to the relevant section.

Translating pandas → Polars — MUST read references/pandas-to-polars.md first (it's short). It carries API-difference traps absent from this file: dict lookup → replace_strict, SQL IN/NOT IN → semi/anti joins, transform → over vs group_by, datetime %f → %.f, and matching pandas column order.
Answering a natural-language data question — read references/insight-recipes.md for ready-made query shapes (top-k, period-over-period, distributions, time series, share-of-total, conditional breakdowns).
references/contexts.md - detailed behavior of select, with_columns, filter, group_by/agg, over (window mapping strategies), sort, and join.
references/expressions.md - string, temporal, list, struct, and selector syntax; casting; null handling; conditionals. Also contains a full fetch-map (18 categories → live docs URLs) for finding or verifying any expression method outside those namespaces.
references/lazy-api.md - scan options for dirty data, query plan inspection with explain(), streaming engine for larger-than-memory data, sink_parquet.

polars

Popularity

Invocation

Context Preview

Supporting Files

SKILL.md

polars

Popularity

Invocation

Context Preview

Supporting Files

SKILL.md

Polars

Default stance

From question to insight

Core rules

Canonical query pattern

Context selection

Gotchas

Version and API verification

When to load references

Similar Skills

Polars

Default stance

From question to insight

Core rules

Canonical query pattern

Context selection

Gotchas

Version and API verification

When to load references

Similar Skills