From scientific-domain-applications
Works with labeled multidimensional arrays for scientific data using Xarray: NetCDF/HDF5/Zarr I/O, Dask integration, DataTree, rioxarray geospatial operations.
npx claudepluginhub uw-ssec/rse-plugins --plugin scientific-domain-applicationsThis skill uses the workspace's default tool permissions.
Master **Xarray**, the powerful library for working with labeled multidimensional arrays in scientific Python. Learn how to efficiently handle complex datasets with multiple dimensions, coordinates, and metadata - from climate data and satellite imagery to experimental measurements and simulations.
Generates design tokens/docs from CSS/Tailwind/styled-components codebases, audits visual consistency across 10 dimensions, detects AI slop in UI.
Records polished WebM UI demo videos of web apps using Playwright with cursor overlay, natural pacing, and three-phase scripting. Activates for demo, walkthrough, screen recording, or tutorial requests.
Delivers idiomatic Kotlin patterns for null safety, immutability, sealed classes, coroutines, Flows, extensions, DSL builders, and Gradle DSL. Use when writing, reviewing, refactoring, or designing Kotlin code.
Master Xarray, the powerful library for working with labeled multidimensional arrays in scientific Python. Learn how to efficiently handle complex datasets with multiple dimensions, coordinates, and metadata - from climate data and satellite imagery to experimental measurements and simulations.
Official Documentation: https://docs.xarray.dev/
GitHub: https://github.com/pydata/xarray
# Using pixi (recommended for scientific projects)
pixi add xarray netcdf4 dask
# Using pip
pip install xarray[complete]
# Optional dependencies for specific formats
pixi add zarr h5netcdf scipy bottleneck
# Geospatial extensions (for raster data, CRS handling, reprojection)
pixi add rioxarray xesmf
# DataTree is built into Xarray (no separate installation needed)
import xarray as xr
import numpy as np
# DataArray: Single labeled array
temperature = xr.DataArray(
data=np.random.randn(3, 4),
dims=["time", "location"],
coords={
"time": ["2024-01-01", "2024-01-02", "2024-01-03"],
"location": ["A", "B", "C", "D"]
},
name="temperature"
)
# Dataset: Collection of DataArrays
ds = xr.Dataset({
"temperature": temperature,
"pressure": (["time", "location"], np.random.randn(3, 4))
})
# Selection by label
ds.sel(time="2024-01-01")
ds.sel(location="A")
# Selection by index
ds.isel(time=0)
# Slicing
ds.sel(time=slice("2024-01-01", "2024-01-02"))
# Aggregation
ds.mean(dim="time")
ds.sum(dim="location")
# Computation
ds["temperature"] + 273.15 # Celsius to Kelvin
ds.groupby("time.month").mean()
# I/O operations
ds.to_netcdf("data.nc")
ds = xr.open_dataset("data.nc")
Working with multidimensional scientific data?
├─ YES → Use Xarray for labeled dimensions
└─ NO → NumPy/Pandas sufficient
Need to track coordinates and metadata?
├─ YES → Xarray keeps everything aligned
└─ NO → Plain NumPy arrays work
Working with geospatial raster data?
├─ YES → Use rioxarray for CRS-aware operations
└─ NO → Standard Xarray sufficient
Data has natural hierarchical structure?
├─ YES → Use DataTree for organization
└─ NO → Dataset/DataArray sufficient
Data too large for memory?
├─ YES → Use Xarray with Dask backend
└─ NO → Standard Xarray is fine
Need to save/load scientific data formats?
├─ NetCDF/HDF5 → Xarray native support
├─ Zarr → Use Xarray with zarr backend
└─ CSV/Excel → Pandas then convert to Xarray
Working with time series data?
├─ Multi-dimensional → Xarray
└─ Tabular → Pandas
Need to align data from different sources?
├─ YES → Xarray handles alignment automatically
└─ NO → Manual alignment with NumPy
Use Xarray when working with:
A DataArray is Xarray's fundamental data structure - think of it as a NumPy array with labels and metadata.
Anatomy of a DataArray:
import xarray as xr
import numpy as np
# Create a DataArray
temperature = xr.DataArray(
data=np.array([[15.2, 16.1, 14.8],
[16.5, 17.2, 15.9],
[17.1, 18.0, 16.5]]),
dims=["time", "location"],
coords={
"time": pd.date_range("2024-01-01", periods=3),
"location": ["Station_A", "Station_B", "Station_C"],
"lat": ("location", [40.7, 34.0, 41.8]),
"lon": ("location", [-74.0, -118.2, -87.6])
},
attrs={
"units": "Celsius",
"description": "Daily average temperature"
}
)
Key components:
A Dataset is like a dict of DataArrays that share dimensions - similar to a Pandas DataFrame but for N-dimensional data.
Example:
# Create a Dataset
ds = xr.Dataset({
"temperature": (["time", "location"], np.random.randn(3, 4)),
"humidity": (["time", "location"], np.random.rand(3, 4) * 100),
"pressure": (["time", "location"], 1013 + np.random.randn(3, 4) * 10)
},
coords={
"time": pd.date_range("2024-01-01", periods=3),
"location": ["A", "B", "C", "D"]
})
Coordinates provide meaningful labels for array dimensions and enable label-based indexing.
Types of coordinates:
Dimension coordinates (1D, same name as dimension):
time_coord = pd.date_range("2024-01-01", periods=365)
Non-dimension coordinates (auxiliary information):
# Latitude/longitude for each station
coords = {
"time": time_coord,
"station": ["A", "B", "C"],
"lat": ("station", [40.7, 34.0, 41.8]),
"lon": ("station", [-74.0, -118.2, -87.6])
}
Xarray provides powerful label-based and position-based indexing.
Label-based selection (.sel):
# Select by coordinate value
ds.sel(time="2024-01-15")
ds.sel(location="Station_A")
# Nearest neighbor selection
ds.sel(time="2024-01-15", method="nearest")
# Range selection
ds.sel(time=slice("2024-01-01", "2024-01-31"))
Position-based selection (.isel):
# Select by integer position
ds.isel(time=0)
ds.isel(location=[0, 2])
Boolean indexing (.where):
# Keep only values meeting condition
ds.where(ds["temperature"] > 15, drop=True)
DataTree is Xarray's class for organizing hierarchical (tree-structured) data. Think of it as a filesystem for datasets, where each node can contain a dataset and child nodes.
When to use DataTree:
Creating a DataTree:
import xarray as xr
# From a dictionary of datasets
dt = xr.DataTree.from_dict({
"/": xr.Dataset({"description": "Root metadata"}),
"/observations": xr.Dataset({"temp": (["time"], [15.2, 16.1, 14.8])}),
"/observations/station_a": xr.Dataset({"location": "New York"}),
"/observations/station_b": xr.Dataset({"location": "Los Angeles"}),
"/model_outputs": xr.Dataset({"predicted_temp": (["time"], [15.0, 16.0, 15.0])})
})
# Access nodes using filesystem-like paths
print(dt["/observations/station_a"])
print(dt["observations"]["station_a"]) # Equivalent
Key DataTree operations:
# Navigate the tree
dt.parent # Get parent node
dt.children # Get child nodes dict
dt.subtree # Iterate over all descendant nodes
dt.leaves # Get all leaf nodes
# Apply operations across all datasets
dt.mean(dim="time") # Apply to all nodes
# Map custom functions
dt.map_over_datasets(lambda ds: ds + 273.15)
# Filter nodes
dt.match("*/station_*") # Pattern matching
dt.filter(lambda node: "temp" in node.ds.data_vars) # Content-based filtering
# Coordinate inheritance (child nodes inherit parent coordinates)
# Define coordinates once at parent level, accessible in all children
Combining DataTrees:
# Arithmetic operations on isomorphic trees
dt1 + dt2 # Add corresponding datasets at each node
# Check structure compatibility
dt1.isomorphic(dt2) # Returns True if same structure
Xarray has a rich ecosystem of extensions for domain-specific workflows. For geospatial data analysis, prioritize rioxarray over vanilla Xarray.
Key geospatial extensions:
rioxarray - Geospatial raster operations:
import rioxarray
# Open raster with CRS (Coordinate Reference System) awareness
ds = rioxarray.open_rasterio("satellite_image.tif")
# Reproject to different CRS
ds_reprojected = ds.rio.reproject("EPSG:4326")
# Clip to bounding box
ds_clipped = ds.rio.clip_box(minx=-120, miny=35, maxx=-115, maxy=40)
# Write with CRS metadata
ds.rio.to_raster("output.tif")
Other useful extensions:
When to use which:
See references/patterns.md for detailed patterns including:
See references/examples.md for complete examples including:
See references/common-issues.md for solutions to:
Xarray is the go-to library for working with labeled multidimensional arrays in scientific Python. It combines the power of NumPy arrays with the convenience of Pandas labels, making it ideal for climate data, satellite imagery, experimental measurements, and any data with multiple dimensions.
Key takeaways:
Next steps:
Xarray transforms complex multidimensional data analysis into intuitive, readable code while maintaining high performance and scalability.