Help us improve
Share bugs, ideas, or general feedback.
From zarr-data-format
Configure and optimize numcodecs compression for Zarr arrays: Blosc, Zstd, LZ4, Gzip, LZMA; pre-filters (Delta, Quantize); pipelines, Blosc thread safety, speed/ratio trade-offs.
npx claudepluginhub uw-ssec/rse-plugins --plugin zarr-data-formatHow this skill is triggered — by the user, by Claude, or both
Slash command
/zarr-data-format:compression-codecsThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Configure, select, and optimize compression codecs for Zarr arrays using **numcodecs**. This skill covers every compressor and filter in the Zarr ecosystem, thread safety for multi-process workloads, codec pipelines in Zarr v3, and performance trade-offs.
Stores large N-dimensional arrays with chunking, compression, and pluggable storage (local, S3, GCS, ZIP, memory). Supports out-of-core computation and Dask/Xarray integration.
Provides chunked N-D arrays for cloud storage with compression, parallel I/O, S3/GCS integration, and NumPy/Dask/Xarray compatibility for large-scale scientific computing.
Work with Zarr Python library for chunked, compressed N-dimensional arrays. Covers creation, groups, metadata/attributes, indexing, data types, sharding, v2/v3 differences.
Share bugs, ideas, or general feedback.
Configure, select, and optimize compression codecs for Zarr arrays using numcodecs. This skill covers every compressor and filter in the Zarr ecosystem, thread safety for multi-process workloads, codec pipelines in Zarr v3, and performance trade-offs.
Zarr Performance Guide: https://zarr.readthedocs.io/en/latest/user-guide/performance/ numcodecs Reference: https://numcodecs.readthedocs.io/ GitHub: https://github.com/zarr-developers/numcodecs
| Codec | Speed (compress) | Speed (decompress) | Ratio | Best For |
|---|---|---|---|---|
| Blosc+LZ4 | Very Fast | Very Fast | Low-Med | Real-time analysis, frequent reads |
| Blosc+Zstd | Medium | Fast | High | General purpose (v2 default) |
| Zstd standalone | Medium | Fast | High | Zarr v3 default |
| Blosc+LZ4HC | Slow | Very Fast | Medium | Write-once, read-many |
| Gzip | Slow | Medium | Med-High | Interop with non-Python tools |
| LZ4 standalone | Very Fast | Very Fast | Low | Maximum throughput |
| LZMA | Very Slow | Very Slow | Very High | Archival only |
Blosc wraps internal algorithms and adds byte-shuffling — the single most impactful setting for numerical data compression. Shuffle rearranges bytes to expose patterns, yielding 10–40× better ratios.
| Parameter | Options | Default |
|---|---|---|
cname | blosclz, lz4, lz4hc, snappy, zlib, zstd | blosclz |
clevel | 0 (none) – 9 (max) | 5 |
shuffle | NOSHUFFLE (0), SHUFFLE (1), BITSHUFFLE (2) | SHUFFLE |
from numcodecs import Blosc
Blosc(cname='zstd', clevel=5, shuffle=Blosc.SHUFFLE) # balanced
Blosc(cname='lz4', clevel=1, shuffle=Blosc.SHUFFLE) # max speed
Blosc(cname='zstd', clevel=9, shuffle=Blosc.BITSHUFFLE) # max ratio
Blosc's internal threading is not fork-safe. Multi-process use (Dask workers, multiprocessing, joblib) can cause silent data corruption.
from numcodecs import blosc
blosc.use_threads = False # ALWAYS set this in multi-process environments
# For Dask distributed:
client.run(lambda: setattr(__import__('numcodecs').blosc, 'use_threads', False))
| Codec | Import | Key Config |
|---|---|---|
| Zstd (v3 default) | from numcodecs import Zstd | Zstd(level=3) — levels 1–22 |
| LZ4 | from numcodecs import LZ4 | LZ4(acceleration=1) |
| Gzip | from numcodecs import GZip | GZip(level=5) — levels 1–9 |
| Zlib | from numcodecs import Zlib | Zlib(level=4) — levels 1–9 |
| BZ2 | from numcodecs import BZ2 | BZ2(level=5) — levels 1–9 |
| LZMA | from numcodecs import LZMA | LZMA(preset=6) — presets 0–9 |
Filters transform data before compression to improve ratios. Applied in order.
| Filter | Use Case | Example |
|---|---|---|
| Delta | Monotonic data (timestamps, indices) | Delta(dtype='int64') |
| Quantize | Reduce float precision | Quantize(digits=3, dtype='float64') |
| FixedScaleOffset | Convert floats to ints | FixedScaleOffset(offset=273.15, scale=100, dtype='float64', astype='int32') |
| PackBits | Boolean arrays (8× reduction) | PackBits() |
| Categorize | String→integer encoding | Categorize(labels=['a','b','c'], dtype='U10', astype='u1') |
# v2: filters + compressor
z = zarr.open_array('data.zarr', mode='w', shape=(10000,), dtype='int64', chunks=(1000,),
filters=[Delta(dtype='int64')], compressor=Blosc(cname='zstd', clevel=5))
# Chain: Delta → Quantize → compressor
filters=[Delta(dtype='float64'), Quantize(digits=3, dtype='float64')]
v3 replaces compressor + filters with a unified pipeline: array→array → array→bytes → bytes→bytes.
import zarr
# v3 with default Zstd
z = zarr.create_array(store='data.zarr', shape=(1000, 1000), chunks=(100, 100),
dtype='float64', zarr_format=3)
# v3 with explicit compressor
z = zarr.create_array(store='data.zarr', shape=(1000, 1000), chunks=(100, 100),
dtype='float64', compressors=zarr.codecs.ZstdCodec(level=5))
# No compression
z = zarr.create_array(store='data.zarr', shape=(1000, 1000), chunks=(100, 100),
dtype='float64', compressors=None)
Primary constraint?
├── STORAGE SIZE → Zstd level 9 or LZMA (archival only)
├── READ SPEED → Blosc+LZ4 with SHUFFLE (numerical) or LZ4 standalone
├── WRITE SPEED → LZ4(acceleration=10) or Blosc+LZ4 clevel=1
├── BALANCED → Blosc+Zstd clevel=3 (v2) or Zstd level=3 (v3)
├── INTEROP → Gzip (universal) or Zlib (NetCDF compat)
└── DATA TYPE
├── Monotonic → Delta filter + any compressor
├── Boolean → PackBits + LZ4
├── Integer → Blosc BITSHUFFLE
└── Limited precision float → Quantize filter + Zstd
blosc.use_threads = False in any multi-process environment