From sciagent-skills
Queries NCI Imaging Data Commons (IDC) for cancer radiology and pathology DICOM/WSI datasets on Google Cloud. Searches by modality, site, or type; downloads via GCS/IDAT. Free BigQuery metadata queries.
npx claudepluginhub jaechang-hits/sciagent-skills --plugin sciagent-skillsThis skill uses the workspace's default tool permissions.
NCI Imaging Data Commons (IDC) is NCI's cloud-based repository for cancer imaging data, hosting 50+ TB of publicly accessible DICOM images spanning radiology (CT, MRI, PET) and pathology (whole slide images) across 100+ collections. All data is hosted on Google Cloud Storage and BigQuery, enabling SQL queries over DICOM metadata without downloading. IDC integrates with Google Colab and BigQuery...
Processes DICOM medical images (CT, MRI, X-ray, ultrasound): read/write files, extract pixels as NumPy arrays, edit tags, apply VOI LUT windowing, anonymize PHI, build 3D volumes from series.
Reads, writes, modifies DICOM medical imaging files; extracts pixel data from CT/MRI/X-ray/ultrasound; anonymizes data, handles metadata/tags, converts formats, processes compressed data.
Reads, writes, modifies DICOM medical imaging files (CT, MRI, X-ray) with pydicom. Extracts pixel data, handles metadata/tags, anonymizes, converts formats, processes compressed data.
Share bugs, ideas, or general feedback.
NCI Imaging Data Commons (IDC) is NCI's cloud-based repository for cancer imaging data, hosting 50+ TB of publicly accessible DICOM images spanning radiology (CT, MRI, PET) and pathology (whole slide images) across 100+ collections. All data is hosted on Google Cloud Storage and BigQuery, enabling SQL queries over DICOM metadata without downloading. IDC integrates with Google Colab and BigQuery, making large-scale imaging research accessible without local storage.
pydicom-medical-imaging; for WSI preprocessing use histolabrequests, pandas, google-cloud-bigquery (for SQL queries), idc-indexgcloud auth for downloadspip install requests pandas idc-index google-cloud-bigquery
# For downloads: install Google Cloud SDK and authenticate
# gcloud auth application-default login
from idc_index import index
# List all available IDC collections
idc_client = index.IDCClient()
collections = idc_client.get_collections()
print(f"Available IDC collections: {len(collections)}")
print(collections[["collection_id", "cancer_type", "location", "species"]].head(10))
Browse available IDC collections and filter by cancer type or modality.
from idc_index import index
import pandas as pd
idc_client = index.IDCClient()
# Get all collections with metadata
collections = idc_client.get_collections()
print(f"Total IDC collections: {len(collections)}")
print("Columns:", list(collections.columns))
# Filter for lung CT collections
lung_ct = collections[
(collections["cancer_type"].str.contains("Lung", case=False, na=False)) |
(collections["location"].str.contains("Lung", case=False, na=False))
]
print(f"\nLung-related collections: {len(lung_ct)}")
print(lung_ct[["collection_id", "cancer_type", "location"]].head())
# List modalities available in a collection
collection_id = "LIDC-IDRI" # Lung Image Database Consortium
series = idc_client.get_series(collection_id=collection_id)
modalities = series["Modality"].value_counts()
print(f"\nModalities in {collection_id}:")
print(modalities)
print(f"Total series: {len(series)}")
Query for specific imaging series within collections.
from idc_index import index
import pandas as pd
idc_client = index.IDCClient()
# Get all CT series for lung cancer
ct_series = idc_client.get_series(
collection_id="LIDC-IDRI",
modality="CT"
)
print(f"CT series in LIDC-IDRI: {len(ct_series)}")
print(ct_series[["PatientID", "StudyInstanceUID", "SeriesInstanceUID", "Modality"]].head())
# Query across all collections for specific modality + body part
all_series = idc_client.get_series() # All IDC series metadata
mr_brain = all_series[
(all_series["Modality"] == "MR") &
(all_series["BodyPartExamined"].str.contains("BRAIN", case=False, na=False))
]
print(f"Brain MRI series across IDC: {len(mr_brain)}")
print(mr_brain["collection_id"].value_counts().head(10))
Use BigQuery for scalable DICOM metadata queries across all IDC data.
from google.cloud import bigquery
import pandas as pd
# BigQuery client (requires Google Cloud authentication)
client = bigquery.Client(project="your-gcp-project-id")
# Count series by modality and cancer type
query = """
SELECT
Modality,
collection_id,
COUNT(DISTINCT SeriesInstanceUID) AS num_series,
COUNT(DISTINCT PatientID) AS num_patients
FROM `bigquery-public-data.idc_current.dicom_all`
WHERE Modality IN ('CT', 'MR', 'PET')
GROUP BY Modality, collection_id
ORDER BY num_series DESC
LIMIT 20
"""
df = client.query(query).to_dataframe()
print(df.to_string(index=False))
# Find all lung CT studies with tumor segmentation
query2 = """
SELECT DISTINCT
d.PatientID,
d.StudyInstanceUID,
d.collection_id,
d.Modality,
d.BodyPartExamined
FROM `bigquery-public-data.idc_current.dicom_all` d
WHERE d.collection_id IN ('LIDC-IDRI', 'TCGA-LUAD', 'TCGA-LUSC')
AND d.Modality = 'CT'
LIMIT 100
"""
df2 = client.query(query2).to_dataframe()
print(f"Lung CT studies: {len(df2)}")
print(df2.head())
Download DICOM series using the idc-index download utilities.
from idc_index import index
import os
idc_client = index.IDCClient()
# Download a specific series by SeriesInstanceUID
series_uid = "1.3.6.1.4.1.14519.5.2.1.6279.6001.179049373636438705059720603192"
output_dir = "./downloaded_dicom/"
os.makedirs(output_dir, exist_ok=True)
idc_client.download_dicom_series(
seriesInstanceUID=series_uid,
downloadDir=output_dir,
quiet=False
)
print(f"Downloaded series to {output_dir}")
# List downloaded files
import glob
files = glob.glob(f"{output_dir}/**/*.dcm", recursive=True)
print(f"Downloaded {len(files)} DICOM files")
Get detailed statistics for a specific IDC collection.
from idc_index import index
import pandas as pd
idc_client = index.IDCClient()
collection_id = "TCGA-GBM" # Glioblastoma
series = idc_client.get_series(collection_id=collection_id)
print(f"Collection: {collection_id}")
print(f"Total series: {len(series)}")
print(f"Patients: {series['PatientID'].nunique()}")
print(f"Studies: {series['StudyInstanceUID'].nunique()}")
modality_summary = series.groupby("Modality")["SeriesInstanceUID"].count()
print(f"\nModalities:")
print(modality_summary)
if "BodyPartExamined" in series.columns:
print(f"\nBody parts: {series['BodyPartExamined'].value_counts().head()}")
Use the IDC REST API for collection information without local client.
import requests, pandas as pd
IDC_API = "https://api.imaging.datacommons.cancer.gov/v1"
# Get all collections
r = requests.get(f"{IDC_API}/collections")
r.raise_for_status()
collections = r.json()["collections"]
print(f"IDC collections via REST API: {len(collections)}")
# Convert to DataFrame
df = pd.DataFrame(collections)
print(df.columns.tolist())
print(df[["collection_id", "cancer_type", "location"]].head(5).to_string(index=False))
# Get metadata for a specific collection
collection_id = "LIDC-IDRI"
r = requests.get(f"{IDC_API}/collections/{collection_id}")
if r.ok:
data = r.json()
print(f"Collection: {collection_id}")
print(f" Description: {str(data)[:200]}")
IDC organizes data following the DICOM hierarchy: Collection → Patient → Study (StudyInstanceUID) → Series (SeriesInstanceUID) → Instance (SOPInstanceUID). Downloads are typically at the Series level.
All IDC images are stored in Google Cloud Storage (GCS) buckets. Files can be accessed directly via gs://idc-open-data/{SeriesInstanceUID}/ paths using gsutil or the GCS Python client. BigQuery metadata tables link DICOM tags to GCS paths.
Goal: Select imaging series from IDC matching specific criteria and prepare download manifest.
from idc_index import index
import pandas as pd
idc_client = index.IDCClient()
# Step 1: Get all series across IDC
all_series = idc_client.get_series()
print(f"Total IDC series: {len(all_series)}")
# Step 2: Filter for CT scans in specific cancer types
target_collections = ["LIDC-IDRI", "TCGA-LUAD", "TCGA-LUSC"]
ct_lung = all_series[
(all_series["collection_id"].isin(target_collections)) &
(all_series["Modality"] == "CT")
].copy()
print(f"CT lung series: {len(ct_lung)}")
print(ct_lung.groupby("collection_id")["SeriesInstanceUID"].count())
# Step 3: Sample balanced dataset
sample_size = 100
ct_sampled = ct_lung.sample(min(sample_size, len(ct_lung)), random_state=42)
# Step 4: Save manifest
ct_sampled[["SeriesInstanceUID", "PatientID", "collection_id", "Modality"]].to_csv(
"ct_lung_manifest.csv", index=False
)
print(f"\nSaved manifest with {len(ct_sampled)} series → ct_lung_manifest.csv")
Goal: Download a small sample from a collection for exploratory analysis.
from idc_index import index
import pydicom
import glob
import os
idc_client = index.IDCClient()
# Get series from a collection
series = idc_client.get_series(collection_id="TCGA-GBM", modality="MR")
print(f"Brain MRI series: {len(series)}")
# Download first 2 series for exploration
sample_series = series["SeriesInstanceUID"].iloc[:2].tolist()
output_dir = "./tcga_gbm_sample/"
os.makedirs(output_dir, exist_ok=True)
for uid in sample_series:
print(f"Downloading series: {uid[:30]}...")
idc_client.download_dicom_series(
seriesInstanceUID=uid,
downloadDir=output_dir,
quiet=True
)
# Inspect one downloaded DICOM file
dcm_files = glob.glob(f"{output_dir}/**/*.dcm", recursive=True)
if dcm_files:
ds = pydicom.dcmread(dcm_files[0])
print(f"\nDICOM file info:")
print(f" Patient ID : {ds.PatientID}")
print(f" Modality : {ds.Modality}")
print(f" Rows x Cols : {ds.Rows} x {ds.Columns}")
print(f" Pixel array : {ds.pixel_array.shape}")
| Parameter | Module | Default | Range / Options | Effect |
|---|---|---|---|---|
collection_id | get_series/download | — | IDC collection name | Filter by specific collection |
modality | get_series | — | "CT", "MR", "PET", "SM" (slide microscopy) | Filter by imaging modality |
seriesInstanceUID | download | required | DICOM UID string | Specific series to download |
downloadDir | download | required | directory path | Local directory for DICOM files |
quiet | download | False | True/False | Suppress download progress output |
| BigQuery dataset | BigQuery | idc_current | idc_current, idc_v17 | IDC BigQuery version |
Use idc-index for metadata filtering first: Always query metadata (series, modality, patient count) before downloading. Downloads can be large (GB to TB per collection).
Use BigQuery for complex cross-collection queries: The IDC BigQuery tables (bigquery-public-data.idc_current.dicom_all) enable SQL-based filtering across all 50+ TB of metadata without any downloads.
Download only needed series: IDC series can range from 1 MB to 5 GB; always save a manifest CSV of SeriesInstanceUIDs to download rather than bulk-downloading entire collections.
Version-lock your datasets: IDC releases versioned datasets (v14, v15…). Always note the IDC version used (idc_client.get_idc_version()) for reproducibility.
Use Google Cloud credits: IDC BigQuery queries and GCS egress are free within certain limits; for large downloads, consider requesting Google Cloud for Researchers credits.
When to use: Identify all IDC collections covering a specific cancer type.
from idc_index import index
idc_client = index.IDCClient()
collections = idc_client.get_collections()
glioma = collections[collections["cancer_type"].str.contains("Glioma|Glioblastoma", case=False, na=False)]
print(glioma[["collection_id", "cancer_type", "location"]].to_string(index=False))
When to use: Find patients with multimodal imaging for fusion or cross-modal studies.
from idc_index import index
import pandas as pd
idc_client = index.IDCClient()
series = idc_client.get_series(collection_id="TCGA-GBM")
ct_patients = set(series[series["Modality"] == "CT"]["PatientID"])
mr_patients = set(series[series["Modality"] == "MR"]["PatientID"])
multimodal = ct_patients & mr_patients
print(f"Patients with both CT and MRI: {len(multimodal)}")
When to use: Construct Google Cloud Storage URL to access DICOM files directly.
from idc_index import index
idc_client = index.IDCClient()
series = idc_client.get_series(collection_id="LIDC-IDRI", modality="CT")
uid = series["SeriesInstanceUID"].iloc[0]
# IDC GCS path pattern
gcs_path = f"gs://idc-open-data/{uid}/"
print(f"GCS path: {gcs_path}")
print(f"Access with: gsutil ls {gcs_path}")
| Problem | Cause | Solution |
|---|---|---|
idc_index authentication error | Not authenticated with GCP | Run gcloud auth application-default login |
| BigQuery billing error | No billing account linked | Link billing in GCP console; first 1 TB/month of queries is free |
| Download very slow | GCS egress charges or throttling | Use Google Compute Engine or Colab (co-located with GCS) |
SeriesInstanceUID not found | Old idc-index version cached | Run idc_client.fetch_index() to refresh the local index |
pydicom can't read downloaded file | Corrupted download | Re-download; verify file size matches expected |
| Empty modality filter result | Modality string capitalization | Use uppercase modality codes: "CT" not "ct" |
pydicom-medical-imaging — Local DICOM file processing for downloaded IDC imageshistolab-wsi-processing — Whole slide image preprocessing for IDC pathology (SM modality) seriespathml — ML pipeline for computational pathology using IDC slide data