From zarr-data-format
Integrates Zarr with xarray and Dask for reading, writing, and analyzing labeled multi-dimensional scientific data. Covers chunk alignment, encoding, appending, consolidated metadata, and performance optimization.
npx claudepluginhub uw-ssec/rse-plugins --plugin zarr-data-formatThis skill uses the workspace's default tool permissions.
Use **xarray** as the high-level interface for reading, writing, and analyzing Zarr datasets. xarray adds labeled dimensions, coordinates, and metadata to Zarr's chunked array storage, while **Dask** provides parallel and out-of-core computation. This skill covers the full xarray-Zarr workflow: opening stores, writing with encoding, appending data, region writes, chunk alignment, and performanc...
Generates design tokens/docs from CSS/Tailwind/styled-components codebases, audits visual consistency across 10 dimensions, detects AI slop in UI.
Records polished WebM UI demo videos of web apps using Playwright with cursor overlay, natural pacing, and three-phase scripting. Activates for demo, walkthrough, screen recording, or tutorial requests.
Delivers idiomatic Kotlin patterns for null safety, immutability, sealed classes, coroutines, Flows, extensions, DSL builders, and Gradle DSL. Use when writing, reviewing, refactoring, or designing Kotlin code.
Use xarray as the high-level interface for reading, writing, and analyzing Zarr datasets. xarray adds labeled dimensions, coordinates, and metadata to Zarr's chunked array storage, while Dask provides parallel and out-of-core computation. This skill covers the full xarray-Zarr workflow: opening stores, writing with encoding, appending data, region writes, chunk alignment, and performance optimization.
xarray Documentation: https://docs.xarray.dev/
Zarr Documentation: https://zarr.readthedocs.io/
Dask Documentation: https://docs.dask.org/
# Using pixi (recommended)
pixi add xarray zarr dask numpy netcdf4
# Using pip
pip install xarray[complete] zarr dask[complete]
# For cloud-hosted Zarr stores
pixi add s3fs gcsfs fsspec
pip install zarr[remote]
import xarray as xr
# ── Read a Zarr store ──
ds = xr.open_zarr("data.zarr") # local
ds = xr.open_zarr("s3://bucket/data.zarr") # S3 (public)
ds = xr.open_zarr("gs://bucket/data.zarr") # GCS
# ── Read with explicit Dask chunks ──
ds = xr.open_zarr("data.zarr", chunks={"time": 30, "lat": 90, "lon": 180})
# ── Write to Zarr ──
ds.to_zarr("output.zarr")
# ── Write with encoding ──
encoding = {
"temperature": {"chunks": {"time": 30, "lat": 90, "lon": 180}, "dtype": "float32"},
"precipitation": {"chunks": {"time": 30, "lat": 90, "lon": 180}, "dtype": "float32"},
}
ds.to_zarr("output.zarr", encoding=encoding)
# ── Append along a dimension ──
ds_new.to_zarr("output.zarr", append_dim="time")
# ── Write to a specific region ──
ds_chunk.to_zarr("output.zarr", region={"time": slice(100, 200)})
# ── Consolidated metadata (faster cloud opens) ──
ds.to_zarr("output.zarr", consolidated=True)
ds = xr.open_zarr("output.zarr", consolidated=True)
Want to read an existing Zarr store?
├── Local path → xr.open_zarr("path.zarr")
├── Cloud URL → xr.open_zarr("s3://...", storage_options={"anon": True})
└── Need specific chunks → add chunks= parameter
Want to write xarray data to Zarr?
├── New store → ds.to_zarr("out.zarr")
├── With compression → ds.to_zarr("out.zarr", encoding={...})
├── Append time steps → ds.to_zarr("out.zarr", append_dim="time")
└── Parallel region writes → ds.to_zarr("out.zarr", region={...})
Performance issues?
├── Slow open → Use consolidated=True
├── Slow compute → Align Dask chunks with Zarr chunks
└── Memory blow-up → Use compute=False or write in regions
Use this skill when:
xarray provides open_zarr() as the primary entry point for reading Zarr stores.
import xarray as xr
# ── Basic local read ──
ds = xr.open_zarr("climate_data.zarr")
print(ds)
# ── Cloud read (S3, anonymous) ──
ds = xr.open_zarr(
"s3://cmip6-pds/CMIP6/CMIP/NOAA-GFDL/GFDL-ESM4/historical/r1i1p1f1/Omon/tos/gn/v20190726/",
storage_options={"anon": True},
consolidated=True,
)
# ── With explicit Dask chunks (lazy loading) ──
ds = xr.open_zarr(
"large_dataset.zarr",
chunks={"time": 30, "lat": 90, "lon": 180},
)
# Data is NOT loaded — ds.temperature is a Dask array
print(ds["temperature"].data) # dask.array<...>
# ── Using open_dataset with engine="zarr" (equivalent) ──
ds = xr.open_dataset("data.zarr", engine="zarr", chunks={})
Key parameters for open_zarr:
| Parameter | Default | Description |
|---|---|---|
chunks | "auto" | Dask chunk sizes; {} = use Zarr chunks; None = load eagerly |
consolidated | None | Read consolidated metadata (faster for v2 cloud stores) |
storage_options | None | Passed to fsspec (e.g., {"anon": True} for public S3) |
decode_cf | True | Decode CF conventions (times, units, masks) |
decode_times | True | Decode time coordinates |
group | None | Open a specific group within the store |
import xarray as xr
import numpy as np
# ── Create a sample Dataset ──
ds = xr.Dataset(
{
"temperature": (["time", "lat", "lon"], np.random.randn(365, 180, 360).astype("float32")),
"precipitation": (["time", "lat", "lon"], np.random.rand(365, 180, 360).astype("float32")),
},
coords={
"time": np.arange(365),
"lat": np.linspace(-89.5, 89.5, 180),
"lon": np.linspace(0.5, 359.5, 360),
},
attrs={"title": "Sample Climate Dataset"},
)
# ── Basic write ──
ds.to_zarr("output.zarr", mode="w")
# ── Write with encoding (recommended) ──
encoding = {
"temperature": {
"chunks": {"time": 30, "lat": 90, "lon": 180},
"dtype": "float32",
"compressor": None, # use Zarr default (Zstd for v3)
},
"precipitation": {
"chunks": {"time": 30, "lat": 90, "lon": 180},
"dtype": "float32",
},
}
ds.to_zarr("output.zarr", mode="w", encoding=encoding, consolidated=True)
# ── Write to cloud ──
ds.to_zarr(
"s3://my-bucket/output.zarr",
storage_options={"key": "...", "secret": "..."},
mode="w",
)
Zarr supports two patterns for incrementally adding data: append (grow a dimension) and region (write to a specific slice).
import xarray as xr
import numpy as np
# ── Append along a dimension ──
# First write: create the store
ds_initial = xr.Dataset({
"temperature": (["time", "lat", "lon"], np.random.randn(30, 180, 360).astype("float32")),
}, coords={"time": np.arange(30), "lat": np.linspace(-89.5, 89.5, 180), "lon": np.linspace(0.5, 359.5, 360)})
ds_initial.to_zarr("timeseries.zarr", mode="w")
# Subsequent writes: append new time steps
for month in range(1, 12):
ds_month = xr.Dataset({
"temperature": (["time", "lat", "lon"], np.random.randn(30, 180, 360).astype("float32")),
}, coords={"time": np.arange(month * 30, (month + 1) * 30), "lat": np.linspace(-89.5, 89.5, 180), "lon": np.linspace(0.5, 359.5, 360)})
ds_month.to_zarr("timeseries.zarr", append_dim="time")
# ── Region writes (parallel-safe) ──
# Pre-allocate the full store
ds_full = xr.Dataset({
"temperature": (["time", "lat", "lon"], np.full((365, 180, 360), np.nan, dtype="float32")),
}, coords={"time": np.arange(365), "lat": np.linspace(-89.5, 89.5, 180), "lon": np.linspace(0.5, 359.5, 360)})
ds_full.to_zarr("parallel_output.zarr", mode="w", compute=False)
# Each worker writes its own region
def write_region(day_start, day_end):
data = np.random.randn(day_end - day_start, 180, 360).astype("float32")
ds_chunk = xr.Dataset({
"temperature": (["time", "lat", "lon"], data),
}, coords={
"time": np.arange(day_start, day_end),
"lat": np.linspace(-89.5, 89.5, 180),
"lon": np.linspace(0.5, 359.5, 360),
})
ds_chunk.to_zarr("parallel_output.zarr", region={"time": slice(day_start, day_end)})
# Safe for concurrent writes from multiple workers
write_region(0, 30)
write_region(30, 60)
Combine multiple files into a single virtual Zarr store without copying data.
import xarray as xr
# ── open_mfdataset with Zarr files ──
ds = xr.open_mfdataset(
["year_2020.zarr", "year_2021.zarr", "year_2022.zarr"],
engine="zarr",
concat_dim="time",
combine="nested",
chunks={"time": 365},
)
# ── VirtualiZarr for reference-based access ──
# Creates virtual references to existing files (no data copy)
from virtualizarr import open_virtual_dataset
vds_list = []
for path in ["data_2020.nc", "data_2021.nc", "data_2022.nc"]:
vds = open_virtual_dataset(path)
vds_list.append(vds)
combined = xr.concat(vds_list, dim="time")
combined.virtualize.to_zarr("combined_refs.zarr") # write virtual store
Dask chunks must align with Zarr chunks for optimal performance. Misaligned chunks cause redundant reads and wasted memory.
import xarray as xr
# ── Check current Zarr chunk sizes ──
ds = xr.open_zarr("data.zarr", chunks={}) # use Zarr's native chunks
for var in ds.data_vars:
encoding = ds[var].encoding
print(f"{var}: Zarr chunks = {encoding.get('chunks')}")
print(f"{var}: Dask chunks = {ds[var].data.chunksize}")
# ── Align Dask chunks = Zarr chunks (best practice) ──
ds = xr.open_zarr("data.zarr", chunks={}) # empty dict = match Zarr chunks
# ── Use multiples of Zarr chunks ──
# If Zarr chunks are (30, 90, 180), these are aligned:
ds = xr.open_zarr("data.zarr", chunks={"time": 60, "lat": 90, "lon": 360})
# 60 = 2 * 30 ✓, 90 = 1 * 90 ✓, 360 = 2 * 180 ✓
# ── Misaligned chunks (avoid!) ──
# ds = xr.open_zarr("data.zarr", chunks={"time": 45}) # 45 is not a multiple of 30
Alignment rules:
chunks={} (empty dict) to automatically match Zarr chunksdask_chunk = N * zarr_chunkThe encoding dict controls how each variable is stored in Zarr.
encoding = {
"temperature": {
"chunks": {"time": 30, "lat": 90, "lon": 180}, # Zarr chunk sizes
"dtype": "float32", # on-disk dtype
"compressor": None, # use default (Zstd)
"_FillValue": -9999.0, # fill value
},
"time": {
"chunks": {"time": 365},
"dtype": "int64",
},
}
ds.to_zarr("encoded.zarr", encoding=encoding)
Common encoding fields:
| Field | Purpose |
|---|---|
chunks | Zarr chunk sizes (dict or tuple) |
dtype | On-disk data type |
compressor | Compression codec (numcodecs object or None) |
_FillValue | Fill value for missing data |
scale_factor / add_offset | CF packing parameters |
import xarray as xr
# ── Use consolidated metadata for fast cloud opens ──
ds.to_zarr("s3://bucket/data.zarr", consolidated=True)
ds = xr.open_zarr("s3://bucket/data.zarr", consolidated=True)
# ── Avoid loading data unnecessarily ──
# compute=False writes only metadata (for pre-allocation)
ds.to_zarr("preallocated.zarr", compute=False)
# ── Use Dask for parallel writes ──
ds_lazy = xr.open_zarr("input.zarr", chunks={"time": 30})
result = ds_lazy["temperature"].mean(dim="time")
result.to_dataset(name="temp_mean").to_zarr("mean_output.zarr")
# ── Rechunk before writing if needed ──
ds_rechunked = ds_lazy.chunk({"time": 365, "lat": 45, "lon": 45})
ds_rechunked.to_zarr("rechunked.zarr")