Benchmarks and optimizes Zarr chunking strategies for multi-dimensional scientific datasets in S3/GCS/local storage, measuring time/memory/I/O for spatial/time/spectral access patterns.
npx claudepluginhub uw-ssec/rse-plugins --plugin zarr-chunk-optimizationThis skill uses the workspace's default tool permissions.
**Benchmark and optimize Zarr chunking strategies** for multi-dimensional datasets stored in cloud object stores (S3, GCS) or local filesystems. This skill helps you determine the optimal chunk configuration for your specific access patterns before committing to a rechunking operation.
assets/benchmark-config-example.jsonassets/report-template.mdreferences/README.mdreferences/access-patterns.mdreferences/benchmarking-methodology.mdreferences/cloud-storage-patterns.mdreferences/memory-constraints.mdreferences/nguyen-2023.mdreferences/performance-interpretation.mdscripts/benchmark_runner.pyGenerates design tokens/docs from CSS/Tailwind/styled-components codebases, audits visual consistency across 10 dimensions, detects AI slop in UI.
Records polished WebM UI demo videos of web apps using Playwright with cursor overlay, natural pacing, and three-phase scripting. Activates for demo, walkthrough, screen recording, or tutorial requests.
Delivers idiomatic Kotlin patterns for null safety, immutability, sealed classes, coroutines, Flows, extensions, DSL builders, and Gradle DSL. Use when writing, reviewing, refactoring, or designing Kotlin code.
Benchmark and optimize Zarr chunking strategies for multi-dimensional datasets stored in cloud object stores (S3, GCS) or local filesystems. This skill helps you determine the optimal chunk configuration for your specific access patterns before committing to a rechunking operation.
Research basis: Nguyen et al. (2023), "Impact of Chunk Size on Read Performance of Zarr Data in Cloud-based Object Stores" (DOI: 10.1002/essoar.10511054.2)
When the user invokes this skill, collect the following information:
Dataset location:
/data/mydata.zarrs3://bucket/path/to/data.zarrgs://bucket/path/to/data.zarrDimension names: E.g., ['time', 'frequency', 'baseline']
Current chunk shape: E.g., (1, 2048, 2048) (query with ds.chunks)
Access pattern priorities: Which patterns matter most?
Memory budget: E.g., "8 GB" (typical laptop), "64 GB" (HPC node)
Sample size: How much data to benchmark?
Candidate configurations: User can suggest specific chunk shapes to test
Number of runs: Minimum 5 (default), can increase for high-variance networks
Sections:
Example recommendation:
## Recommendation
**For mixed workloads (spatial + time-series access):**
Use chunk shape **(50, 512, 512)**:
- Spatial access: 8.3 s ± 0.7 s (23% slower than optimal)
- Time-series access: 11.2 s ± 0.9 s (18% slower than optimal)
- Performance bias: 1.35 (well-balanced)
- Peak memory: 3.8 GB (fits in 8 GB budget)
**For spatial-only workloads:**
Use chunk shape **(10, 1024, 1024)** if memory permits:
- Spatial access: 6.1 s ± 0.5 s (optimal for this pattern)
- Peak memory: 10.2 GB (requires 16 GB system)
Saved alongside the report for reproducibility:
{
"date": "2024-03-08",
"python_version": "3.11.7",
"xarray_version": "2024.1.0",
"zarr_version": "2.17.0",
"instance_type": "t2.xlarge",
"storage_backend": "AWS S3 us-east-1"
}
If using Dask, generate dask-report.html showing task graphs and memory usage over time.
The agent should use the helper scripts in scripts/:
benchmark_runner.py: Core benchmarking loop (5+ runs, timing, memory measurement)rechunk.py: Rechunking utilities for generating test configurationssynthetic_data.py: Generate synthetic Zarr data if no real dataset is availableCritical: Clear caches between every run to measure cold-cache performance.
macOS:
sudo purge
Linux:
sync
sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'
fsspec:
# Disable or clear between runs
fsspec.config.conf['cache_storage'] = None
See references/benchmarking-methodology.md for complete methodology.
Use tracemalloc (built-in) or memory_profiler (more accurate):
import tracemalloc
tracemalloc.start()
result = ds.sel(time=42).compute()
current, peak = tracemalloc.get_traced_memory()
tracemalloc.stop()
Always report peak memory, not mean.
Use time.perf_counter() (not time.time()):
import time
start = time.perf_counter()
result = ds.sel(time=42).compute()
wall_time = time.perf_counter() - start
The references/ folder contains detailed documentation the agent should load on-demand:
Load references only when needed to avoid context bloat. The README.md serves as a table of contents for progressive disclosure.
Designed to work with any multi-dimensional Zarr dataset:
time × latitude × longitudepatient × slice × x × yx × y × wavelengthThe user defines their dimension names and which dimensions are sliced vs loaded for each access pattern.