Skill

duckdb-patterns

DuckDB patterns for embedded OLAP analytics: in-process SQL on Parquet/CSV/JSON files, window functions, ASOF joins, dbt-duckdb integration, Dagster assets, Python/Node.js APIs, and performance tuning. Zero infrastructure required — analytical power without a server.

From clarc

Install

Run in your terminal

npx claudepluginhub marvinrichter/clarc --plugin clarc

Tool Access

This skill uses the workspace's default tool permissions.

Skill Content

Similar Skills

agent-harness-construction

Designs and optimizes AI agent action spaces, tool definitions, observation formats, error recovery, and context for higher task completion rates.

everything-claude-code

139.9k

agent-payment-x402

Enables AI agents to execute x402 payments with per-task budgets, spending controls, and non-custodial wallets via MCP tools. Use when agents pay for APIs, services, or other agents.

everything-claude-code

139.9k

agent-eval

Compares coding agents like Claude Code and Aider on custom YAML-defined codebase tasks using git worktrees, measuring pass rate, cost, time, and consistency.

everything-claude-code

139.9k

Stats

Stars7

Forks0

Last CommitMar 6, 2026

Actions

View Source View Plugin View on GitHub View README

DuckDB Patterns

Scope: Embedded analytical SQL — local analytics, ELT pipelines, file querying, and data transformation without infrastructure. For transactional database patterns, see postgres-patterns. For orchestrating pipelines on top of DuckDB, see data-engineering.

When to Activate

Running analytical queries without a database server
Querying Parquet, CSV, or JSON files directly with SQL
Building local or CI-friendly data pipelines
Using dbt with DuckDB as the analytical backend
Writing Dagster assets that produce or consume DuckDB tables
Replacing a heavyweight OLAP database for single-node workloads

Overview

DuckDB is an embedded, in-process OLAP database — SQLite for analytics. It runs inside your Python, Node.js, or CLI process with no server, no cloud dependency, and no operational overhead.

[Source files]         [DuckDB]           [Output]
Parquet / CSV   ─────► SQL queries  ────► DataFrame
JSON / S3       ─────► window fns   ────► Parquet
Postgres (FDW)  ─────► aggregations ────► Visualization

When to choose DuckDB over Postgres for analytics:

Queries scan many rows but few columns (columnar wins)
No persistent server in the environment (local scripts, CI, notebooks)
Direct file analysis without import/ingest step
Sub-second aggregations on hundreds of millions of rows locally

Installation

# Python
pip install duckdb

# Node.js
npm install duckdb

# CLI
brew install duckdb   # macOS

Core SQL Patterns

In-memory vs persisted database

import duckdb

# In-memory (default) — disappears when process ends
conn = duckdb.connect()

# Persisted — survives restarts, sharable
conn = duckdb.connect("analytics.duckdb")

Direct file querying (no import needed)

-- Query Parquet directly
SELECT date_trunc('month', event_date) AS month,
       count(*) AS events
FROM read_parquet('data/events/*.parquet')
GROUP BY 1
ORDER BY 1;

-- Query CSV
SELECT * FROM read_csv('exports/users.csv', header=true, auto_detect=true);

-- Query JSON
SELECT json_extract(payload, '$.user_id') AS user_id
FROM read_ndjson('logs/events.ndjson');

-- Query S3 (with httpfs extension)
INSTALL httpfs; LOAD httpfs;
SELECT * FROM read_parquet('s3://my-bucket/data/*.parquet');

Window functions

-- Rolling 7-day retention
SELECT
    user_id,
    event_date,
    count(*) OVER (
        PARTITION BY user_id
        ORDER BY event_date
        RANGE BETWEEN INTERVAL '6 days' PRECEDING AND CURRENT ROW
    ) AS events_last_7d,
    -- First event ever (for cohort analysis)
    min(event_date) OVER (PARTITION BY user_id) AS first_seen
FROM events;

-- Percentile distribution
SELECT
    percentile_cont(0.5)  WITHIN GROUP (ORDER BY latency_ms) AS p50,
    percentile_cont(0.95) WITHIN GROUP (ORDER BY latency_ms) AS p95,
    percentile_cont(0.99) WITHIN GROUP (ORDER BY latency_ms) AS p99
FROM request_logs;

ASOF join (time-series matching)

-- Match each event to the most recent price at or before event_time
SELECT e.user_id, e.event_time, p.price
FROM events e
ASOF JOIN prices p
    ON e.symbol = p.symbol AND e.event_time >= p.price_time;

Pivot / unpivot

-- Pivot: rows → columns
PIVOT monthly_revenue
ON month
USING sum(revenue)
GROUP BY product;

-- Unpivot: columns → rows
UNPIVOT metrics
ON (clicks, impressions, conversions)
INTO NAME metric VALUE value;

Python Integration

With pandas

import duckdb
import pandas as pd

conn = duckdb.connect()

# Read from DataFrame — no copy, zero-overhead
df = pd.read_parquet("events.parquet")
result = conn.execute("""
    SELECT date_trunc('day', ts) AS day, count(*) AS n
    FROM df
    GROUP BY 1
    ORDER BY 1
""").df()   # returns a DataFrame

# Write query result to Parquet
conn.execute("""
    COPY (SELECT * FROM df WHERE event = 'purchase')
    TO 'purchases.parquet' (FORMAT PARQUET, COMPRESSION ZSTD)
""")

With Polars (zero-copy Arrow)

import duckdb
import polars as pl

conn = duckdb.connect()
lf = pl.scan_parquet("events/*.parquet")

# DuckDB reads Polars LazyFrame directly
result = conn.execute("SELECT * FROM lf WHERE amount > 100").pl()

Relation API (composable queries)

conn = duckdb.connect()

events = conn.read_parquet("events/*.parquet")
result = (
    events
    .filter("event_type = 'purchase'")
    .aggregate("user_id, sum(amount) AS total", "user_id")
    .order("total DESC")
    .limit(100)
)
result.show()

Node.js Integration

import Database from 'duckdb';
import { promisify } from 'util';

const db = new Database(':memory:');  // or 'analytics.duckdb'
const conn = db.connect();
const all = promisify(conn.all.bind(conn));

// Query Parquet from Node.js
const rows = await all(`
    SELECT user_id, count(*) AS events
    FROM read_parquet('./data/events.parquet')
    GROUP BY user_id
    ORDER BY events DESC
    LIMIT 10
`);

console.log(rows);
db.close();

dbt Integration

DuckDB is the recommended backend for local dbt development and CI pipelines.

profiles.yml

analytics:
  target: dev
  outputs:
    dev:
      type: duckdb
      path: "analytics.duckdb"
      threads: 4
    ci:
      type: duckdb
      path: ":memory:"   # ephemeral for CI, fastest possible
      threads: 2

Install

pip install dbt-duckdb
dbt debug   # verify connection
dbt build   # run models + tests

dbt model reading external Parquet

-- models/staging/stg_events.sql
-- dbt-duckdb can reference external files directly
{{ config(materialized='view') }}

SELECT *
FROM read_parquet('{{ env_var("DATA_PATH") }}/events/*.parquet')

Incremental model with DuckDB

-- models/marts/fct_daily_active_users.sql
{{ config(
    materialized='incremental',
    unique_key='day',
    incremental_strategy='delete+insert'
) }}

SELECT
    date_trunc('day', event_time) AS day,
    count(DISTINCT user_id)       AS dau
FROM {{ ref('stg_events') }}
{% if is_incremental() %}
WHERE event_time >= (SELECT max(day) FROM {{ this }})
{% endif %}
GROUP BY 1

Dagster Integration

# assets/duckdb_assets.py
from dagster import asset, AssetExecutionContext
from dagster_duckdb import DuckDBResource
import pandas as pd

@asset(
    group_name="analytics",
    required_resource_keys={"duckdb"},
)
def daily_active_users(context: AssetExecutionContext, duckdb: DuckDBResource) -> None:
    with duckdb.get_connection() as conn:
        conn.execute("""
            CREATE OR REPLACE TABLE dau AS
            SELECT date_trunc('day', event_time) AS day,
                   count(DISTINCT user_id)       AS dau
            FROM read_parquet('/data/events/*.parquet')
            GROUP BY 1
        """)
        count = conn.execute("SELECT count(*) FROM dau").fetchone()[0]
        context.log.info(f"Computed DAU for {count} days")

# resources (definitions.py)
from dagster_duckdb import DuckDBResource

duckdb_resource = DuckDBResource(database="analytics.duckdb")

Performance Tuning

Threading and memory

import duckdb

conn = duckdb.connect()

# Use all cores for large queries (default: all)
conn.execute("SET threads = 8")

# Cap memory usage (default: 80% of RAM)
conn.execute("SET memory_limit = '4GB'")

# Temp directory for spill-to-disk on large joins
conn.execute("SET temp_directory = '/tmp/duckdb'")

Persistent database for repeated queries

# Create a persistent database from Parquet once
conn = duckdb.connect("analytics.duckdb")
conn.execute("""
    CREATE TABLE IF NOT EXISTS events AS
    SELECT * FROM read_parquet('s3://bucket/events/*.parquet')
""")

# Index frequently filtered columns
conn.execute("CREATE INDEX IF NOT EXISTS idx_user ON events(user_id)")

Export to Parquet (efficient handoff)

conn.execute("""
    COPY (
        SELECT * FROM events WHERE event_date >= '2026-01-01'
    )
    TO 'recent_events.parquet'
    (FORMAT PARQUET, COMPRESSION ZSTD, ROW_GROUP_SIZE 100000)
""")

Common Anti-Patterns

Anti-pattern	Problem	Fix
Importing CSV into DuckDB before querying	Unnecessary step	Use `read_csv()` directly
Using DuckDB for OLTP (many small writes)	Not optimized for row-level updates	Use PostgreSQL for transactional data
Single-threaded queries on large files	Slow	DuckDB uses all cores by default — don't force `SET threads = 1`
Storing DuckDB file in tmpfs/ramdisk	Data loss on restart	Use persistent path for anything not ephemeral
Running DuckDB as a server for multiple writers	DuckDB is single-writer	Use Postgres for multi-process write access

data-engineering — dbt + Dagster pipeline orchestration
postgres-patterns — transactional database patterns
analytics-workflow — dashboards and product metrics on top of these pipelines

duckdb-patterns

duckdb-patterns

DuckDB Patterns

When to Activate

Overview

Installation

Core SQL Patterns

In-memory vs persisted database

Direct file querying (no import needed)

Window functions

ASOF join (time-series matching)

Pivot / unpivot

Python Integration

With pandas

With Polars (zero-copy Arrow)

Relation API (composable queries)

Node.js Integration

dbt Integration

profiles.yml

Install

dbt model reading external Parquet

Incremental model with DuckDB

Dagster Integration

Performance Tuning

Threading and memory

Persistent database for repeated queries

Export to Parquet (efficient handoff)

Common Anti-Patterns

Related

DuckDB Patterns

When to Activate

Overview

Installation

Core SQL Patterns

In-memory vs persisted database

Direct file querying (no import needed)

Window functions

ASOF join (time-series matching)

Pivot / unpivot

Python Integration

With pandas

With Polars (zero-copy Arrow)

Relation API (composable queries)

Node.js Integration

dbt Integration

profiles.yml

Install

dbt model reading external Parquet

Incremental model with DuckDB

Dagster Integration

Performance Tuning

Threading and memory

Persistent database for repeated queries

Export to Parquet (efficient handoff)

Common Anti-Patterns

Related