Generates synthetic Zarr datasets for chunking benchmarks with configurable shapes, dimensions, dtypes, compression, patterns. Supports local, S3, GCS storage for reproducible tests.
npx claudepluginhub uw-ssec/rse-plugins --plugin zarr-chunk-optimizationThis skill uses the workspace's default tool permissions.
Controlled benchmarking requires controlled data. When evaluating chunking strategies, using production datasets introduces variables that obscure results: irregular shapes, missing values, network variability, and access restrictions. Synthetic data eliminates these confounders by giving you full control over dimensions, shapes, data types, compression, and statistical patterns. Tests become r...
Generates design tokens/docs from CSS/Tailwind/styled-components codebases, audits visual consistency across 10 dimensions, detects AI slop in UI.
Records polished WebM UI demo videos of web apps using Playwright with cursor overlay, natural pacing, and three-phase scripting. Activates for demo, walkthrough, screen recording, or tutorial requests.
Delivers idiomatic Kotlin patterns for null safety, immutability, sealed classes, coroutines, Flows, extensions, DSL builders, and Gradle DSL. Use when writing, reviewing, refactoring, or designing Kotlin code.
Controlled benchmarking requires controlled data. When evaluating chunking strategies, using production datasets introduces variables that obscure results: irregular shapes, missing values, network variability, and access restrictions. Synthetic data eliminates these confounders by giving you full control over dimensions, shapes, data types, compression, and statistical patterns. Tests become reproducible across environments, no credentials or cloud access are needed for initial exploration, and you can systematically vary one parameter at a time to isolate its effect on performance.
# Generate a 3D dataset with default settings (random pattern, zstd compression)
python synthetic_data.py --output /tmp/test.zarr \
--shape 1000,2048,2048 --chunks 50,256,256
# Generate climate-like data with named dimensions
python synthetic_data.py --output /tmp/climate.zarr \
--shape 3650,180,360 --chunks 365,90,180 \
--dims time,lat,lon --pattern temperature --dtype float32
# Generate radio telescope-like data
python synthetic_data.py --output /tmp/radio.zarr \
--shape 1000,2048,2048 --chunks 50,256,256 \
--dims time,frequency,baseline --pattern radio
# Write to S3
python synthetic_data.py --output s3://bucket/synthetic.zarr \
--shape 500,1024,1024 --chunks 50,256,256 --compression zstd
# Create a sample from an existing Zarr store
python synthetic_data.py --sample-from /data/full.zarr \
--output /tmp/sample.zarr --target-size 8
# No compression (useful for measuring raw I/O)
python synthetic_data.py --output /tmp/raw.zarr \
--shape 500,512,512 --chunks 50,128,128 --compression none
Designing a useful synthetic dataset means matching the characteristics that matter for chunking performance while keeping generation fast and storage small.
Choose dimensions that mirror your production data structure. The number of dimensions and their relative sizes directly affect how many chunks each access pattern touches.
(time, spatial_1, spatial_2) or (time, frequency, baseline)(time, level, lat, lon) for atmospheric modelsChunk shape determines the unit of I/O. When designing test configurations:
The dtype determines bytes per element and directly affects chunk size in bytes:
| dtype | Bytes per element | 256x256 chunk size |
|---|---|---|
| float32 | 4 | 256 KB |
| float64 | 8 | 512 KB |
| int16 | 2 | 128 KB |
| complex64 | 8 | 512 KB |
Use float32 as the default unless your production data uses a different type. Using float64 when your real data is float32 will double chunk sizes and skew benchmark results.
| Argument | Required | Default | Description |
|---|---|---|---|
--output, -o | Yes | — | Output path (local or s3://...) |
--shape | Yes* | — | Comma-separated array shape |
--chunks | Yes* | — | Comma-separated chunk shape |
--dims | No | dim_0,dim_1,... | Comma-separated dimension names |
--dtype | No | float32 | Array data type |
--compression | No | zstd | Codec: zstd, blosc, gzip, none |
--compression-level | No | 3 | Compression level |
--pattern | No | random | Data pattern: random, temperature, radio, constant |
--seed | No | 42 | Random seed for reproducibility |
--overwrite | No | false | Overwrite existing output |
--sample-from | No | — | Path to existing Zarr to sample from |
--target-size | No | 8.0 | Target sample size in GB |
--verbose, -v | No | false | Enable verbose logging |
*Required when not using --sample-from.
Small test dataset for quick iteration:
python synthetic_data.py -o /tmp/small.zarr \
--shape 100,256,256 --chunks 10,64,64
Production-scale test (several GB):
python synthetic_data.py -o /tmp/large.zarr \
--shape 5000,2048,2048 --chunks 100,512,512 \
--dtype float32 --compression zstd
Multiple compression comparison (run sequentially):
for codec in zstd blosc gzip none; do
python synthetic_data.py -o /tmp/test_${codec}.zarr \
--shape 1000,1024,1024 --chunks 50,256,256 \
--compression $codec --overwrite
done
Create a manageable sample from a large existing dataset:
python synthetic_data.py --sample-from s3://bucket/full_data.zarr \
--output /tmp/sample.zarr --target-size 4
Local storage is the default. Provide any valid filesystem path:
python synthetic_data.py -o /data/benchmarks/test.zarr \
--shape 500,1024,1024 --chunks 50,256,256
Local storage is best for initial development and pipeline validation. Disk I/O characteristics differ significantly from cloud object stores, so local benchmarks should not be used to predict cloud performance.
Provide an s3:// URL. The script uses s3fs and requires AWS credentials configured via environment variables, ~/.aws/credentials, or IAM role:
export AWS_PROFILE=my-profile
python synthetic_data.py -o s3://my-bucket/benchmarks/test.zarr \
--shape 1000,1024,1024 --chunks 50,256,256
Ensure the bucket region matches your compute region to minimize latency.
GCS support requires gcsfs to be installed. Provide a gcs:// URL:
python synthetic_data.py -o gcs://my-bucket/benchmarks/test.zarr \
--shape 1000,1024,1024 --chunks 50,256,256
Note: GCS support is listed in the script interface but requires additional implementation. Verify gcsfs is installed and authentication is configured before use.
The --pattern flag controls the statistical structure of generated data. Matching the pattern to your domain improves the realism of compression benchmarks, since compression ratios depend on data regularity.
Climate and weather datasets typically have strong spatial structure and temporal periodicity:
python synthetic_data.py -o /tmp/climate.zarr \
--shape 3650,180,360 --chunks 365,90,180 \
--dims time,lat,lon --pattern temperature --seed 42
The temperature pattern generates spatial gradients with sinusoidal temporal variation, mimicking surface temperature fields.
Radio interferometer visibilities are complex-valued with noise-dominated statistics:
python synthetic_data.py -o /tmp/visibilities.zarr \
--shape 1000,2048,2048 --chunks 50,256,256 \
--dims time,frequency,baseline --pattern radio --seed 42
The radio pattern generates visibility-amplitude-like data from complex Gaussian noise.
For isolating compression codec behavior, use the constant pattern (high redundancy) or random pattern (low compressibility):
# High compressibility — tests codec best-case
python synthetic_data.py -o /tmp/constant.zarr \
--shape 500,512,512 --chunks 50,128,128 --pattern constant
# Low compressibility — tests codec worst-case
python synthetic_data.py -o /tmp/random.zarr \
--shape 500,512,512 --chunks 50,128,128 --pattern random
After generating synthetic data, verify the output matches expectations before running benchmarks.
import zarr
import numpy as np
z = zarr.open("/tmp/test.zarr", mode="r")
print(f"Shape: {z.shape}")
print(f"Chunks: {z.chunks}")
print(f"Dtype: {z.dtype}")
print(f"Dimensions: {z.attrs.get('dimensions')}")
print(f"Pattern: {z.attrs.get('pattern_type')}")
# Check data statistics
sample = z[0:10]
print(f"Sample mean: {np.mean(sample):.4f}")
print(f"Sample std: {np.std(sample):.4f}")
print(f"Min/Max: {np.min(sample):.4f} / {np.max(sample):.4f}")
# Check directory structure
ls -la /tmp/test.zarr/
# Check total size
du -sh /tmp/test.zarr/
Regenerate with the same seed and confirm identical output:
z1 = zarr.open("/tmp/test_run1.zarr", mode="r")
z2 = zarr.open("/tmp/test_run2.zarr", mode="r")
assert np.array_equal(z1[:], z2[:]), "Data differs between runs"
ValueError but this is easily overlooked when constructing commands.--target-size flag with --sample-from to stay within resource limits.--compression none when your production data uses zstd. Compression changes both chunk sizes on disk and CPU cost during reads.--seed or using different seeds across runs makes results non-reproducible. Always record the seed used.float64 for benchmarks when production data is float32. This doubles memory and I/O, distorting benchmark results.pattern_type and created_by in Zarr attributes, but also keep a log.--compression zstd and --compression none to isolate compression overhead.chunking-strategy skill for running benchmarks on generated data