From r-package-skills
Use when code loads or uses duckplyr (library(duckplyr), duckplyr::), processing large datasets with dplyr syntax, working with Parquet files in R, or needing lazy evaluation for bigger-than-memory data
How this skill is triggered — by the user, by Claude, or both
Slash command
/r-package-skills:r-duckplyrThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
**duckplyr is a drop-in replacement for dplyr powered by DuckDB for speed and memory efficiency.** It uses identical syntax but lazy evaluation - operations execute only when results are needed, enabling processing of datasets larger than available RAM.
duckplyr is a drop-in replacement for dplyr powered by DuckDB for speed and memory efficiency. It uses identical syntax but lazy evaluation - operations execute only when results are needed, enabling processing of datasets larger than available RAM.
Read references/API.md before writing code.
references/API.md - Complete function reference and lazy evaluation patterns| Use duckplyr when... | Use dplyr when... | Use duckspatial when... | Use data.table when... |
|---|---|---|---|
| Data >100k rows | Small datasets (<100k) | Spatial operations | In-place modification (:=) |
| Larger-than-memory files | All data fits in RAM | Geospatial joins/buffers | Reference semantics |
| Parquet/CSV on disk | Already in memory | DuckDB + spatial queries | Non-equi joins |
| Lazy pipeline optimization | Immediate results | PMTiles, vector tiles | Keyed/rolling joins |
Key insight: duckplyr works on files without loading into R - queries Parquet/CSV directly from disk or URLs.
library(duckplyr)
# Convert existing data frame
df <- as_duckdb_tibble(my_data)
# Or read files directly (lazy evaluation)
df <- read_parquet_duckdb("large_file.parquet")
# Standard dplyr syntax
result <- df |>
filter(year == 2024) |>
group_by(category) |>
summarise(total = sum(value)) |>
collect() # Materializes result
| Difference | dplyr | duckplyr |
|---|---|---|
| Function name | N/A | as_duckdb_tibble() (not as_duck_frame()) |
| Evaluation | Eager (immediate) | Lazy (until collect()) |
| Sorting | Auto-sorts groups | NO auto-sort - use arrange() |
| NULL handling | na.rm = FALSE default | Excludes NULLs by default |
| Materialization | Always in memory | Controlled by prudence parameter |
"lavish": Converts regardless of size (may OOM)"thrifty": Max 1 million cells (default)"stingy": Never auto-converts (safest for large data)read_parquet_duckdb("file.parquet", prudence = "stingy")
| Task | Function |
|---|---|
| Read Parquet | read_parquet_duckdb(path, prudence = "stingy") |
| Read CSV/JSON | read_csv_duckdb(), read_json_duckdb() |
| Multiple files | read_parquet_duckdb("data_*.parquet") (globs) |
| Convert data frame | as_duckdb_tibble(df) |
| Bring to R | collect() (materializes in R memory) |
| Cache in DuckDB | compute() (temp table) |
| Write file | compute_parquet(), compute_csv() |
| Remote data (HTTP/S3) | db_exec("INSTALL httpfs"), then use URLs |
| Query plan | explain(df |> filter(...)) |
| Memory limit | db_exec("PRAGMA memory_limit = '4GB'") |
| Mistake | Fix |
|---|---|
as_duck_frame() | Use as_duckdb_tibble() |
Early collect() | Keep lazy until end |
| No prudence setting | Set prudence = "stingy" for large files |
| Expecting auto-sort | Use explicit arrange() |
| arrow/readr instead | Use read_*_duckdb() functions |
| Missing httpfs | db_exec("INSTALL httpfs") for URLs |
No compute() caching | Cache expensive intermediates |
See references/API.md for complete function reference
npx claudepluginhub arthurgailes/r-package-skills --plugin r-package-skillsUse when code loads or uses duckspatial (library(duckspatial), duckspatial::), performing spatial joins or areal interpolation on large vector datasets in R, or needing faster spatial operations than sf
Guides Polars DataFrame library for fast in-memory data processing with lazy evaluation, parallel execution, and Apache Arrow. Use for ETL pipelines and faster pandas on 1-100GB RAM datasets.
Fast in-memory DataFrame library for datasets up to 100GB, using lazy evaluation and parallel execution. Good for ETL pipelines and pandas replacement.