From r-package-skills
Use when code loads or uses duckplyr (library(duckplyr), duckplyr::), processing large datasets with dplyr syntax, working with Parquet files in R, or needing lazy evaluation for bigger-than-memory data
npx claudepluginhub arthurgailes/r-package-skills --plugin r-package-skillsThis skill uses the workspace's default tool permissions.
**duckplyr is a drop-in replacement for dplyr powered by DuckDB for speed and memory efficiency.** It uses identical syntax but lazy evaluation - operations execute only when results are needed, enabling processing of datasets larger than available RAM.
Use when code loads or uses duckspatial (library(duckspatial), duckspatial::), performing spatial joins or areal interpolation on large vector datasets in R, or needing faster spatial operations than sf
Guides Polars DataFrame library usage in Python and Rust for fast in-memory data processing. Use for ETL pipelines on 1-100GB datasets with lazy evaluation when pandas is slow.
Provides Polars DataFrame instructions for fast in-memory processing with lazy evaluation, parallel execution, and Apache Arrow. Use for 1-100GB ETL pipelines as faster pandas alternative.
Share bugs, ideas, or general feedback.
duckplyr is a drop-in replacement for dplyr powered by DuckDB for speed and memory efficiency. It uses identical syntax but lazy evaluation - operations execute only when results are needed, enabling processing of datasets larger than available RAM.
Read references/API.md before writing code.
references/API.md - Complete function reference and lazy evaluation patterns| Use duckplyr when... | Use dplyr when... | Use duckspatial when... | Use data.table when... |
|---|---|---|---|
| Data >100k rows | Small datasets (<100k) | Spatial operations | In-place modification (:=) |
| Larger-than-memory files | All data fits in RAM | Geospatial joins/buffers | Reference semantics |
| Parquet/CSV on disk | Already in memory | DuckDB + spatial queries | Non-equi joins |
| Lazy pipeline optimization | Immediate results | PMTiles, vector tiles | Keyed/rolling joins |
Key insight: duckplyr works on files without loading into R - queries Parquet/CSV directly from disk or URLs.
library(duckplyr)
# Convert existing data frame
df <- as_duckdb_tibble(my_data)
# Or read files directly (lazy evaluation)
df <- read_parquet_duckdb("large_file.parquet")
# Standard dplyr syntax
result <- df |>
filter(year == 2024) |>
group_by(category) |>
summarise(total = sum(value)) |>
collect() # Materializes result
| Difference | dplyr | duckplyr |
|---|---|---|
| Function name | N/A | as_duckdb_tibble() (not as_duck_frame()) |
| Evaluation | Eager (immediate) | Lazy (until collect()) |
| Sorting | Auto-sorts groups | NO auto-sort - use arrange() |
| NULL handling | na.rm = FALSE default | Excludes NULLs by default |
| Materialization | Always in memory | Controlled by prudence parameter |
"lavish": Converts regardless of size (may OOM)"thrifty": Max 1 million cells (default)"stingy": Never auto-converts (safest for large data)read_parquet_duckdb("file.parquet", prudence = "stingy")
| Task | Function |
|---|---|
| Read Parquet | read_parquet_duckdb(path, prudence = "stingy") |
| Read CSV/JSON | read_csv_duckdb(), read_json_duckdb() |
| Multiple files | read_parquet_duckdb("data_*.parquet") (globs) |
| Convert data frame | as_duckdb_tibble(df) |
| Bring to R | collect() (materializes in R memory) |
| Cache in DuckDB | compute() (temp table) |
| Write file | compute_parquet(), compute_csv() |
| Remote data (HTTP/S3) | db_exec("INSTALL httpfs"), then use URLs |
| Query plan | explain(df |> filter(...)) |
| Memory limit | db_exec("PRAGMA memory_limit = '4GB'") |
| Mistake | Fix |
|---|---|
as_duck_frame() | Use as_duckdb_tibble() |
Early collect() | Keep lazy until end |
| No prudence setting | Set prudence = "stingy" for large files |
| Expecting auto-sort | Use explicit arrange() |
| arrow/readr instead | Use read_*_duckdb() functions |
| Missing httpfs | db_exec("INSTALL httpfs") for URLs |
No compute() caching | Cache expensive intermediates |
See references/API.md for complete function reference