From sciagent-skills
Query and download NCI Imaging Data Commons cancer radiology/pathology datasets via idc-index Python client. Uses local DuckDB SQL for cohort selection and streams DICOM from public S3/GCS. No auth required.
How this skill is triggered — by the user, by Claude, or both
Slash command
/sciagent-skills:imaging-data-commonsThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
NCI Imaging Data Commons (IDC) is the largest public collection of cancer imaging data, hosting 175+ DICOM collections (CT, MR, PET, slide microscopy, segmentations, structured reports). The idc-index Python client ships the entire IDC metadata catalog as a parquet file bundled inside the pip wheel; IDCClient() loads it into DuckDB, so sql_query() runs locally with zero network calls.
NCI Imaging Data Commons (IDC) is the largest public collection of cancer imaging data, hosting 175+ DICOM collections (CT, MR, PET, slide microscopy, segmentations, structured reports). The idc-index Python client ships the entire IDC metadata catalog as a parquet file bundled inside the pip wheel; IDCClient() loads it into DuckDB, so sql_query() runs locally with zero network calls.
Image downloads stream from public AWS S3 (default) or Google Cloud Storage buckets via the bundled s5cmd executable. No GCP/AWS credentials, no BigQuery billing, no service account JSON.
idc-index (>=0.12), pandas, pydicom (for reading DICOM files after download)series_size_MBs5cmd executable used for high-speed S3 transfers# Skip when already provisioned in a pixi or conda env
pip install idc-index pydicom
from idc_index import IDCClient
client = IDCClient()
print('IDC version:', client.get_idc_version())
print('Collections:', len(client.get_collections()))
print('First 5:', client.get_collections()[:5])
df = client.sql_query("""
SELECT collection_id, COUNT(DISTINCT SeriesInstanceUID) AS n_series
FROM index
WHERE Modality = 'CT'
GROUP BY collection_id
ORDER BY n_series DESC LIMIT 5
""")
print(df)
IDCClient() is the single entry point. It loads the parquet index, registers it as the DuckDB table named index, and validates the bundled s5cmd. get_collections() returns a plain Python list of lowercase collection IDs such as nsclc_radiomics, lidc_idri, and tcga_gbm.
from idc_index import IDCClient
client = IDCClient()
collections = client.get_collections()
print("total:", len(collections), "type:", type(collections).__name__)
print("lung-related:", [c for c in collections if "lung" in c or "nsclc" in c][:5])
The recommended cohort-selection API. The table name is index; additional tables (prior_versions_index and optional sm_index, clinical_index) are auto-registered when installed. Returns a pandas DataFrame. Use this for any filter involving Modality, BodyPartExamined, series_size_MB, or arbitrary DICOM tags. The legacy get_series(collection_id=..., modality=...) signature no longer exists in idc-index.
# Cohort: small CT series in NSCLC Radiomics, sorted by size for cheap testing
df = client.sql_query("""
SELECT SeriesInstanceUID, StudyInstanceUID, PatientID, Modality, series_size_MB
FROM index
WHERE collection_id = 'nsclc_radiomics'
AND Modality = 'CT'
ORDER BY series_size_MB ASC
LIMIT 5
""")
print(df[["PatientID", "Modality", "series_size_MB"]])
# Cross-collection lung CT count, using DuckDB ILIKE for case-insensitive matching
df = client.sql_query("""
SELECT collection_id, COUNT(DISTINCT SeriesInstanceUID) AS n
FROM index
WHERE Modality = 'CT' AND BodyPartExamined ILIKE '%LUNG%'
GROUP BY collection_id
ORDER BY n DESC LIMIT 5
""")
print(df)
For DICOM-hierarchy navigation (Collection -> Patient -> Study -> Series), use the typed helpers. Each accepts an outputFormat of dict (default), df, or list. Note: get_dicom_series() takes a studyInstanceUID (not collection_id). Use sql_query() if you want to filter series by collection or modality.
patients = client.get_patients('nsclc_radiomics', outputFormat='df')
print("patients:", patients.shape, list(patients.columns)[:5])
# Walk down the hierarchy from one patient
pid = patients["PatientID"].iloc[0]
studies = client.get_dicom_studies(pid, outputFormat='df')
print("studies for", pid, ":", studies.shape)
study_uid = studies["StudyInstanceUID"].iloc[0]
series = client.get_dicom_series(study_uid, outputFormat='df')
print("series in study:", series.shape, "modalities:", series["Modality"].unique().tolist())
The preferred download method. Accepts any combination of collection_id, patientId, studyInstanceUID, seriesInstanceUID, sopInstanceUID, or crdc_series_uuid (each a string or list). Filters apply in sequence. Use dry_run=True to size the cohort before pulling bytes. Files land under dirTemplate (default: %collection_id/%PatientID/%StudyInstanceUID/%Modality_%SeriesInstanceUID).
import tempfile, glob, os
# Pick the smallest CT series for a fast smoke test (about 40 MB)
df = client.sql_query("""
SELECT SeriesInstanceUID FROM index
WHERE collection_id = 'nsclc_radiomics' AND Modality = 'CT'
ORDER BY series_size_MB ASC LIMIT 1
""")
uid = df["SeriesInstanceUID"].iloc[0]
out = tempfile.mkdtemp(prefix='idc_dl_')
client.download_from_selection(
downloadDir=out,
seriesInstanceUID=[uid],
quiet=True,
show_progress_bar=False,
source_bucket_location='aws', # 'aws' (default) or 'gcs'
)
files = glob.glob(os.path.join(out, "**/*.dcm"), recursive=True)
print(f"downloaded {len(files)} DICOM files to {out}")
# Size a cohort before downloading
client.download_from_selection(
downloadDir='./preview',
collection_id='nsclc_radiomics',
dry_run=True,
)
Single-series download. Internally calls download_from_selection(seriesInstanceUID=...). Accepts a string or list of UIDs.
import tempfile, glob, os
out = tempfile.mkdtemp(prefix='idc_dl_')
client.download_dicom_series(
seriesInstanceUID=uid, # string or list
downloadDir=out,
quiet=True,
show_progress_bar=False,
)
print("files:", len(glob.glob(os.path.join(out, "**/*.dcm"), recursive=True)))
Returns a shareable URL to the IDC OHIF (radiology) or Slim (slide microscopy) viewer. Auto-selects the viewer based on modality if viewer_selector is omitted.
url = client.get_viewer_URL(seriesInstanceUID=uid)
print("Open in browser:", url)
# Force a specific viewer
url_v2 = client.get_viewer_URL(seriesInstanceUID=uid, viewer_selector='ohif_v2')
print(url_v2)
After download, files are organized under dirTemplate. Use pydicom for header and pixel inspection.
import pydicom, glob, os
dcm_files = sorted(glob.glob(os.path.join(out, "**/*.dcm"), recursive=True))
ds = pydicom.dcmread(dcm_files[0])
print("PatientID:", ds.PatientID)
print("Modality :", ds.Modality)
print("Rows x Cols:", ds.Rows, "x", ds.Columns)
print("Pixel array:", ds.pixel_array.shape)
The IDC client is unusual: client = IDCClient() is fully offline for queries. The parquet index (hundreds of MB, ships in idc-index-data) is read into memory and registered as the DuckDB table named index. Every sql_query(), get_collections(), get_patients(), etc. is a local pandas/DuckDB operation. Network traffic only happens during download_from_selection() / download_dicom_series(), which shell out to the bundled s5cmd executable to copy from public buckets (s3://idc-open-data/... by default; switch to gcs via source_bucket_location=gcs). Consequently: no gcloud auth, no GCP project ID, no BigQuery billing, no API keys.
IDC follows the standard DICOM model: Collection (collection_id, lowercase with underscores like nsclc_radiomics) -> Patient (PatientID) -> Study (StudyInstanceUID) -> Series (SeriesInstanceUID) -> Instance (SOPInstanceUID). Downloads operate at the series level by default. The crdc_series_uuid is an alternative immutable identifier preferred for long-term references.
The index is pinned to the installed idc-index-data package version (client.get_idc_version(), e.g., v24). For reproducibility, pin both idc-index and idc-index-data in your environment. Older releases remain accessible via the prior_versions_index table inside sql_query().
Goal: Build a CSV manifest of small CT series for ML prototyping, then download them.
from idc_index import IDCClient
import pandas as pd, tempfile, os, glob
client = IDCClient()
# Step 1: SQL cohort: small CTs across three lung collections
cohort = client.sql_query("""
SELECT SeriesInstanceUID, PatientID, collection_id, Modality, series_size_MB
FROM index
WHERE collection_id IN ('nsclc_radiomics', 'tcga_luad', 'tcga_lusc')
AND Modality = 'CT'
AND series_size_MB < 60
ORDER BY series_size_MB ASC
""")
print(f"Cohort: {len(cohort)} series, total {cohort['series_size_MB'].sum():.1f} MB")
# Step 2: Save manifest
cohort.to_csv('ct_lung_manifest.csv', index=False)
# Step 3: Download a small subset
out = tempfile.mkdtemp(prefix='idc_cohort_')
client.download_from_selection(
downloadDir=out,
seriesInstanceUID=cohort["SeriesInstanceUID"].head(2).tolist(),
quiet=True,
show_progress_bar=False,
)
print("downloaded", len(glob.glob(os.path.join(out, "**/*.dcm"), recursive=True)), "files")
Goal: Drill from collection -> patient -> studies -> series -> DICOM headers without writing SQL.
from idc_index import IDCClient
import tempfile, glob, os, pydicom
client = IDCClient()
# Pick first patient in the collection
patients = client.get_patients('nsclc_radiomics', outputFormat='df')
pid = patients["PatientID"].iloc[0]
studies = client.get_dicom_studies(pid, outputFormat='df')
series_df = client.get_dicom_series(studies["StudyInstanceUID"].iloc[0], outputFormat='df')
print(f"Patient {pid}: {len(studies)} studies, {len(series_df)} series in first study")
print("modalities:", series_df["Modality"].unique().tolist())
# Pick the smallest image series (CT/MR, not SR/SEG) and download
imaging = series_df[series_df["Modality"].isin(["CT", "MR", "PT"])]
target_uid = imaging.sort_values("series_size_MB").iloc[0]["SeriesInstanceUID"]
out = tempfile.mkdtemp(prefix='idc_one_')
client.download_dicom_series(seriesInstanceUID=target_uid, downloadDir=out,
quiet=True, show_progress_bar=False)
ds = pydicom.dcmread(sorted(glob.glob(os.path.join(out, "**/*.dcm"), recursive=True))[0])
print("first slice:", ds.Modality, ds.pixel_array.shape)
print("viewer:", client.get_viewer_URL(seriesInstanceUID=target_uid))
| Parameter | Module / Function | Default | Range / Options | Effect |
|---|---|---|---|---|
| outputFormat | get_patients / get_dicom_studies / get_dicom_series | dict | dict, df, list | Return type of accessor methods |
| seriesInstanceUID | download_from_selection / download_dicom_series | None | str or list[str] | Filter by DICOM SeriesInstanceUID |
| collection_id | download_from_selection / get_patients | None / required | lowercase string(s) like nsclc_radiomics | Filter by collection (lowercase, underscored) |
| source_bucket_location | download_from_selection / download_dicom_series | aws | aws, gcs | Public bucket to pull from |
| dirTemplate | download_from_selection | %collection_id/%PatientID/%StudyInstanceUID/%Modality_%SeriesInstanceUID | template string with %-prefixed DICOM tags, or None for flat layout | On-disk folder hierarchy |
| dry_run | download_from_selection / download_dicom_series | False | True/False | Compute cohort size without downloading |
| use_s5cmd_sync | download_from_selection / download_dicom_series | False | True/False | Use s5cmd sync (resume partial) instead of cp |
| viewer_selector | get_viewer_URL | None (auto) | ohif_v2, ohif_v3, slim | Force a specific browser viewer |
Use sql_query() for any filter beyond a single hierarchy step: It returns DataFrames, scales to the full ~10M-row index instantly via DuckDB, and supports arbitrary DICOM tags. Reach for get_patients / get_dicom_studies / get_dicom_series only when you already have the parent UID.
Always dry-run large cohorts: Calling download_from_selection(..., dry_run=True) prints the total size before any bytes move. A small collection like lidc_idri is ~120 GB, while nlst exceeds 10 TB.
Pin both idc-index and idc-index-data: The index version drives reproducibility. Record client.get_idc_version() (e.g., v24) in your dataset card or paper methods.
Prefer AWS for most downloads: source_bucket_location=aws is the default and is fastest from most academic networks. Switch to gcs only when running on GCP compute that is co-located with the bucket.
Do not try to use the legacy BigQuery flow as a default: bigquery-public-data.idc_current.dicom_all still works, but requires GCP auth and billing and is no longer the recommended path. See Recipes for one optional snippet.
When to use: Build a multimodal study where each patient must have both modalities.
df = client.sql_query("""
SELECT PatientID
FROM index
WHERE collection_id = 'tcga_gbm'
GROUP BY PatientID
HAVING SUM(CASE WHEN Modality = 'CT' THEN 1 ELSE 0 END) > 0
AND SUM(CASE WHEN Modality = 'MR' THEN 1 ELSE 0 END) > 0
""")
print(f"Patients with both CT and MR in TCGA-GBM: {len(df)}")
When to use: Send collaborators a shortlist of series to inspect visually.
sample = client.sql_query("""
SELECT SeriesInstanceUID, Modality
FROM index WHERE collection_id = 'lidc_idri' AND Modality = 'CT'
LIMIT 5
""")
for _, row in sample.iterrows():
print(row.Modality, client.get_viewer_URL(seriesInstanceUID=row.SeriesInstanceUID))
When to use: You need to join IDC metadata against another BigQuery public dataset (e.g., TCGA clinical) and have a GCP project with billing already configured. For all other use cases, prefer sql_query().
# Requires: a google-cloud-bigquery install and gcloud application-default credentials.
from google.cloud import bigquery
bq = bigquery.Client(project='your-gcp-project-id')
df = bq.query("""
SELECT collection_id, Modality, COUNT(DISTINCT SeriesInstanceUID) AS n
FROM `bigquery-public-data.idc_current.dicom_all`
WHERE Modality IN ('CT', 'MR', 'PET')
GROUP BY collection_id, Modality
ORDER BY n DESC LIMIT 20
""").to_dataframe()
print(df)
| Problem | Cause | Solution |
|---|---|---|
ImportError: No module named idc_index | Package not installed | Run pip install idc-index (no auth setup needed) |
| IDCClient() slow on first call | Loading parquet index into memory (1-3 s, one-time per process) | Reuse the client across queries; do not re-instantiate per call |
| sql_query: Catalog Error: Table idc_index does not exist | Wrong table name | The DuckDB table is named index, not idc_index or dicom_all |
| get_dicom_series(collection_id=..., modality=...) raises TypeError | Outdated API; that signature was removed | Use sql_query with WHERE collection_id and Modality filters instead |
| download_from_selection errors with s5cmd: command not found | Bundled s5cmd not on PATH | Reinstall idc-index; the wheel bundles s5cmd. On NixOS/non-glibc systems, set IDC_INDEX_S5CMD_EXE to a system s5cmd |
| Downloads stall or are very slow | Network bottleneck, not auth | Try source_bucket_location=gcs if on GCP; otherwise use use_s5cmd_sync=True to resume |
| ValueError: collection_id ... does not exist | Collection ID is case-sensitive and underscored | Use lowercase with underscores: lidc_idri not LIDC-IDRI; check client.get_collections() |
| pydicom cannot read pixel data | Compressed transfer syntax | Install pylibjpeg, pylibjpeg-libjpeg, gdcm, then retry ds.pixel_array |
npx claudepluginhub jaechang-hits/sciagent-skills --plugin sciagent-skillsQueries and downloads public cancer imaging data from the NCI Imaging Data Commons via the idc-index Python package. Filters by metadata (cancer type, modality, anatomy), visualizes DICOM in-browser, and checks licenses—no authentication needed. Use for radiology (CT, MR, PET) or pathology datasets for AI/ML training or research.
Query and download public cancer imaging data (CT, MR, PET, pathology) from NCI Imaging Data Commons using idc-index. Supports metadata queries, DICOM downloads, browser visualization, and license checks.
Reads/writes DICOM medical images (CT, MRI, X-ray, ultrasound). Extracts pixels as NumPy arrays, edits tags, applies VOI LUT windowing, anonymizes PHI, builds 3D volumes from series.