Help us improve
Share bugs, ideas, or general feedback.
Benchmarks and optimizes Zarr chunking strategies for multi-dimensional scientific datasets in S3/GCS/local storage, measuring time/memory/I/O for spatial/time/spectral access patterns.
npx claudepluginhub uw-ssec/rse-plugins --plugin zarr-chunk-optimizationHow this skill is triggered — by the user, by Claude, or both
Slash command
/zarr-chunk-optimization:chunking-strategyThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
**Benchmark and optimize Zarr chunking strategies** for multi-dimensional datasets stored in cloud object stores (S3, GCS) or local filesystems. This skill helps you determine the optimal chunk configuration for your specific access patterns before committing to a rechunking operation.
assets/benchmark-config-example.jsonassets/report-template.mdreferences/README.mdreferences/access-patterns.mdreferences/benchmarking-methodology.mdreferences/cloud-storage-patterns.mdreferences/memory-constraints.mdreferences/nguyen-2023.mdreferences/performance-interpretation.mdscripts/benchmark_runner.pyIdentifies, formalizes, and prioritizes data access patterns for Zarr datasets from user workflows. Maps to xarray operations for chunk optimization benchmarking.
Integrates Zarr with xarray and Dask for reading, writing, and analyzing labeled multi-dimensional scientific data. Covers chunk alignment, encoding, appending, consolidated metadata, and performance optimization.
Chunked, compressed N-dimensional arrays for cloud storage with Zarr — parallel I/O, S3/GCS integration, and NumPy/Dask/Xarray compatibility.
Share bugs, ideas, or general feedback.
Benchmark and optimize Zarr chunking strategies for multi-dimensional datasets stored in cloud object stores (S3, GCS) or local filesystems. This skill helps you determine the optimal chunk configuration for your specific access patterns before committing to a rechunking operation.
Research basis: Nguyen et al. (2023), "Impact of Chunk Size on Read Performance of Zarr Data in Cloud-based Object Stores" (DOI: 10.1002/essoar.10511054.2)
When the user invokes this skill, collect the following information:
Dataset location:
/data/mydata.zarrs3://bucket/path/to/data.zarrgs://bucket/path/to/data.zarrDimension names: E.g., ['time', 'frequency', 'baseline']
Current chunk shape: E.g., (1, 2048, 2048) (query with ds.chunks)
Access pattern priorities: Which patterns matter most?
Memory budget: E.g., "8 GB" (typical laptop), "64 GB" (HPC node)
Sample size: How much data to benchmark?
Candidate configurations: User can suggest specific chunk shapes to test
Number of runs: Minimum 5 (default), can increase for high-variance networks
Sections:
Example recommendation:
## Recommendation
**For mixed workloads (spatial + time-series access):**
Use chunk shape **(50, 512, 512)**:
- Spatial access: 8.3 s ± 0.7 s (23% slower than optimal)
- Time-series access: 11.2 s ± 0.9 s (18% slower than optimal)
- Performance bias: 1.35 (well-balanced)
- Peak memory: 3.8 GB (fits in 8 GB budget)
**For spatial-only workloads:**
Use chunk shape **(10, 1024, 1024)** if memory permits:
- Spatial access: 6.1 s ± 0.5 s (optimal for this pattern)
- Peak memory: 10.2 GB (requires 16 GB system)
Saved alongside the report for reproducibility:
{
"date": "2024-03-08",
"python_version": "3.11.7",
"xarray_version": "2024.1.0",
"zarr_version": "2.17.0",
"instance_type": "t2.xlarge",
"storage_backend": "AWS S3 us-east-1"
}
If using Dask, generate dask-report.html showing task graphs and memory usage over time.
The agent should use the helper scripts in scripts/:
benchmark_runner.py: Core benchmarking loop (5+ runs, timing, memory measurement)rechunk.py: Rechunking utilities for generating test configurationssynthetic_data.py: Generate synthetic Zarr data if no real dataset is availableCritical: Clear caches between every run to measure cold-cache performance.
macOS:
sudo purge
Linux:
sync
sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'
fsspec:
# Disable or clear between runs
fsspec.config.conf['cache_storage'] = None
See references/benchmarking-methodology.md for complete methodology.
Use tracemalloc (built-in) or memory_profiler (more accurate):
import tracemalloc
tracemalloc.start()
result = ds.sel(time=42).compute()
current, peak = tracemalloc.get_traced_memory()
tracemalloc.stop()
Always report peak memory, not mean.
Use time.perf_counter() (not time.time()):
import time
start = time.perf_counter()
result = ds.sel(time=42).compute()
wall_time = time.perf_counter() - start
The references/ folder contains detailed documentation the agent should load on-demand:
Load references only when needed to avoid context bloat. The README.md serves as a table of contents for progressive disclosure.
Designed to work with any multi-dimensional Zarr dataset:
time × latitude × longitudepatient × slice × x × yx × y × wavelengthThe user defines their dimension names and which dimensions are sliced vs loaded for each access pattern.