Safely rechunks Zarr datasets with validation, progress reporting, memory limits, rollback safety via rechunk.py CLI. Supports local and S3 cloud storage.
npx claudepluginhub uw-ssec/rse-plugins --plugin zarr-chunk-optimizationThis skill uses the workspace's default tool permissions.
**Rechunking is one of the most expensive operations you can perform on a Zarr dataset.** Depending on chunk size and dataset volume, rechunking can take anywhere from approximately 6 minutes to over 46 hours (Nguyen et al., 2023, [DOI: 10.1002/essoar.10511054.2](https://doi.org/10.1002/essoar.10511054.2)). The operation rewrites every byte of data, and an interrupted or misconfigured rechunk c...
Generates design tokens/docs from CSS/Tailwind/styled-components codebases, audits visual consistency across 10 dimensions, detects AI slop in UI.
Records polished WebM UI demo videos of web apps using Playwright with cursor overlay, natural pacing, and three-phase scripting. Activates for demo, walkthrough, screen recording, or tutorial requests.
Delivers idiomatic Kotlin patterns for null safety, immutability, sealed classes, coroutines, Flows, extensions, DSL builders, and Gradle DSL. Use when writing, reviewing, refactoring, or designing Kotlin code.
Rechunking is one of the most expensive operations you can perform on a Zarr dataset. Depending on chunk size and dataset volume, rechunking can take anywhere from approximately 6 minutes to over 46 hours (Nguyen et al., 2023, DOI: 10.1002/essoar.10511054.2). The operation rewrites every byte of data, and an interrupted or misconfigured rechunk can corrupt an entire dataset. This skill wraps the rechunk.py script with a safety-first workflow: validate inputs, estimate costs, rechunk to a new location, verify outputs, and only then swap the result into place.
# Basic rechunk (local)
python rechunk.py --input /data/source.zarr \
--output /data/rechunked.zarr \
--chunks "50,512,512"
# Rechunk with memory limit
python rechunk.py --input /data/source.zarr \
--output /data/rechunked.zarr \
--chunks "100,256,256" \
--max-mem "4GB"
# Rechunk cloud data
python rechunk.py --input s3://bucket/source.zarr \
--output s3://bucket/rechunked.zarr \
--chunks "50,512,512" \
--max-mem "4GB"
# Overwrite existing output
python rechunk.py --input /data/source.zarr \
--output /data/rechunked.zarr \
--chunks "200,128,128" \
--overwrite
# Save summary to specific path
python rechunk.py --input /data/source.zarr \
--output /data/rechunked.zarr \
--chunks "50,512,512" \
--summary results/rechunk_report.json
# Verbose mode for debugging
python rechunk.py --input /data/source.zarr \
--output /data/rechunked.zarr \
--chunks "50,512,512" \
--verbose
CLI flags:
| Flag | Required | Default | Description |
|---|---|---|---|
--input, -i | Yes | — | Input Zarr store path (local or s3://) |
--output, -o | Yes | — | Output Zarr store path (local or s3://) |
--chunks, -c | Yes | — | Target chunk shape, comma-separated (e.g., "50,512,512") |
--max-mem | No | 2GB | Maximum memory for rechunker library |
--overwrite | No | false | Overwrite output if it exists |
--summary | No | auto-generated | Path for JSON summary file |
--verbose, -v | No | false | Enable debug-level logging |
Use this skill when:
Do not use this skill if you have not first benchmarked the proposed chunk configuration. Rechunking is expensive and irreversible without a backup.
Nguyen et al. (2023) measured rechunking times across a range of chunk sizes for multi-dimensional scientific datasets stored in cloud object stores:
| Chunk Size Category | Approximate Time | Example Shape |
|---|---|---|
| Large chunks | ~6 minutes | (100, 1024, 1024) |
| Medium chunks | ~1–4 hours | (50, 512, 512) |
| Small chunks | ~10–46 hours | (1, 64, 64) |
Factors affecting rechunking cost:
The rechunk.py script reports the Nguyen et al. range (6 minutes to 46 hours) as a guideline because precise estimation requires knowing the specific storage backend and network conditions.
The scripts/rechunk.py script performs the following steps automatically:
--overwrite to proceed).rechunker library if available (preferred), falls back to manual chunk-by-chunk copy via zarr.copy.# Standard local rechunk
python scripts/rechunk.py \
--input /data/observations.zarr \
--output /data/observations_rechunked.zarr \
--chunks "50,512,512"
# Cloud-to-cloud with increased memory budget
python scripts/rechunk.py \
--input s3://my-bucket/raw/data.zarr \
--output s3://my-bucket/optimized/data.zarr \
--chunks "100,256,256" \
--max-mem "8GB"
Follow this four-step protocol for every rechunking operation:
Rechunk a small subset of the data to verify the configuration works before committing to the full dataset.
# Extract a sample (e.g., first 10 time steps)
python -c "
import zarr
src = zarr.open('/data/full.zarr', 'r')
sample = zarr.open('/tmp/sample.zarr', 'w',
shape=(10,) + src.shape[1:],
chunks=src.chunks, dtype=src.dtype)
sample[:] = src[:10]
"
# Rechunk the sample
python scripts/rechunk.py \
--input /tmp/sample.zarr \
--output /tmp/sample_rechunked.zarr \
--chunks "5,512,512"
Inspect the rechunked sample to confirm data integrity:
import zarr
import numpy as np
source = zarr.open('/tmp/sample.zarr', 'r')
result = zarr.open('/tmp/sample_rechunked.zarr', 'r')
assert source.shape == result.shape, "Shape mismatch"
assert source.dtype == result.dtype, "Dtype mismatch"
assert np.array_equal(source[:], result[:]), "Data mismatch"
print("Sample validation passed")
Once the sample passes validation, rechunk the full dataset to a new path (never in place):
python scripts/rechunk.py \
--input /data/full.zarr \
--output /data/full_rechunked.zarr \
--chunks "50,512,512" \
--max-mem "4GB"
After the full rechunk completes, verify the output and then swap it into the production path:
# Verify (the script does basic validation automatically)
# For extra safety, spot-check a few values:
python -c "
import zarr, numpy as np
src = zarr.open('/data/full.zarr', 'r')
dst = zarr.open('/data/full_rechunked.zarr', 'r')
# Check random slices
for idx in [0, len(src)//2, len(src)-1]:
assert np.array_equal(src[idx], dst[idx]), f'Mismatch at index {idx}'
print('Spot-check passed')
"
# Swap
mv /data/full.zarr /data/full_backup.zarr
mv /data/full_rechunked.zarr /data/full.zarr
The --max-mem flag controls how much memory the rechunker library is allowed to use. This is critical for large datasets that do not fit in RAM.
How it works:
rechunker library is available, --max-mem is passed directly to rechunker.rechunk(), which plans an execution graph that respects the memory bound.Guidelines for setting --max-mem:
| System Memory | Recommended --max-mem | Reasoning |
|---|---|---|
| 8 GB | 2GB | Leave headroom for OS and other processes |
| 16 GB | 4GB | Safe default for most workloads |
| 64 GB | 16GB | HPC nodes with dedicated rechunking jobs |
| 128+ GB | 32GB | Large-scale production rechunking |
Warning: Setting --max-mem too high can cause OOM kills. Setting it too low increases the number of read/write cycles and slows down the operation. Start conservative and increase if rechunking is too slow.
When rechunking data stored in S3 or GCS, additional considerations apply:
The rechunker library requires an intermediate (temporary) storage location. The script automatically creates a temporary directory for this purpose. For cloud-to-cloud rechunking, the intermediate store is created locally by default. To use cloud intermediate storage:
# In custom scripts, specify cloud intermediate storage
import rechunker
plan = rechunker.rechunk(
source, target_chunks=chunks,
max_mem="4GB",
target_store="s3://bucket/output.zarr",
temp_store="s3://bucket/tmp/rechunk_temp"
)
plan.execute()
s3fs for S3 access. Ensure AWS credentials are configured (~/.aws/credentials or environment variables).The script automatically validates:
For additional validation beyond what the script provides:
import zarr
import numpy as np
src = zarr.open("source.zarr", "r")
dst = zarr.open("output.zarr", "r")
# Dtype check
assert src.dtype == dst.dtype, "Dtype mismatch"
# Metadata check
for key in src.attrs:
assert key in dst.attrs, f"Missing attribute: {key}"
assert src.attrs[key] == dst.attrs[key], f"Attribute mismatch: {key}"
# Value sampling (spot-check random positions)
rng = np.random.default_rng(42)
for _ in range(100):
idx = tuple(rng.integers(0, s) for s in src.shape)
assert src[idx] == dst[idx], f"Value mismatch at {idx}"
The safest approach to rechunking follows a write-verify-swap pattern:
# Swap pattern
mv /data/dataset.zarr /data/dataset_backup.zarr
mv /data/dataset_rechunked.zarr /data/dataset.zarr
# After validation in production
rm -rf /data/dataset_backup.zarr
For cloud storage, use versioned buckets or copy to a backup prefix before swapping.
--max-mem appropriately can cause OOM errors that corrupt the output. Start conservative.--overwrite on the source path. Write to a separate output and swap after validation.--verbose to watch memory usage during rechunking. If the process is killed by OOM, reduce --max-mem.--summary to keep a record of every rechunking operation for reproducibility.rechunker for better memory management and execution planning. The fallback chunk-by-chunk copy works but is slower and less memory-efficient.