From sciagent-skills
Analyzes single-cell RNA-seq data with Scanpy: QC, normalization, HVG selection, PCA/UMAP/t-SNE, Leiden clustering, markers, annotation, trajectory inference. For count matrices in h5ad/10X.
npx claudepluginhub jaechang-hits/sciagent-skills --plugin sciagent-skillsThis skill uses the workspace's default tool permissions.
Scanpy is a scalable Python toolkit for analyzing single-cell RNA-seq data built on the AnnData format. This skill covers the end-to-end standard workflow: quality control, normalization, highly variable gene selection, dimensionality reduction, clustering, marker gene identification, and cell type annotation. It produces annotated datasets and publication-quality visualizations.
Analyzes single-cell RNA-seq data using Scanpy: loads .h5ad/10X/CSV, QC, normalization, PCA/UMAP/t-SNE, Leiden clustering, marker genes, cell annotation, trajectories.
Analyzes single-cell RNA-seq data with Scanpy: QC, normalization, dimensionality reduction (PCA/UMAP/t-SNE), clustering, marker genes, visualization, trajectory inference. For exploratory workflows.
Analyzes single-cell RNA-seq data with Scanpy: quality control, normalization, dimensionality reduction, clustering, marker genes, visualization, trajectory inference. Supports .h5ad, 10X, CSV.
Share bugs, ideas, or general feedback.
Scanpy is a scalable Python toolkit for analyzing single-cell RNA-seq data built on the AnnData format. This skill covers the end-to-end standard workflow: quality control, normalization, highly variable gene selection, dimensionality reduction, clustering, marker gene identification, and cell type annotation. It produces annotated datasets and publication-quality visualizations.
scanpy>=1.10, leidenalg, igraph, anndatafiltered_feature_bc_matrix/), .h5ad, .h5, .csvpip install "scanpy[leiden]" anndata
Configure scanpy settings and read the count matrix into an AnnData object. Ensure gene names are unique.
import scanpy as sc
import numpy as np
import pandas as pd
# Configure settings
sc.settings.verbosity = 3 # errors=0, warnings=1, info=2, hints=3
sc.settings.set_figure_params(dpi=80, facecolor="white")
# Load data — pick the reader matching your format
adata = sc.read_10x_mtx(
"path/to/filtered_feature_bc_matrix/",
var_names="gene_symbols",
cache=True,
)
# Alternatives:
# adata = sc.read_h5ad("data.h5ad")
# adata = sc.read_10x_h5("data.h5")
adata.var_names_make_unique()
print(f"Loaded: {adata.n_obs} cells x {adata.n_vars} genes")
print(f"Sparsity: {1 - adata.X.nnz / (adata.n_obs * adata.n_vars):.1%}")
Annotate mitochondrial/ribosomal genes, compute QC metrics, visualize distributions, and filter low-quality cells.
# Annotate gene groups
adata.var["mt"] = adata.var_names.str.startswith("MT-") # human; use "mt-" for mouse
adata.var["ribo"] = adata.var_names.str.startswith(("RPS", "RPL"))
# Calculate QC metrics
sc.pp.calculate_qc_metrics(
adata, qc_vars=["mt", "ribo"], percent_top=None, log1p=False, inplace=True
)
# Visualize QC distributions
sc.pl.violin(
adata,
["n_genes_by_counts", "total_counts", "pct_counts_mt"],
jitter=0.4,
multi_panel=True,
)
sc.pl.scatter(adata, x="total_counts", y="pct_counts_mt")
sc.pl.scatter(adata, x="total_counts", y="n_genes_by_counts")
# Filter — adjust thresholds based on the QC plots above
sc.pp.filter_cells(adata, min_genes=200)
sc.pp.filter_genes(adata, min_cells=3)
adata = adata[adata.obs.n_genes_by_counts < 5000, :] # remove potential doublets
adata = adata[adata.obs.pct_counts_mt < 20, :] # remove dying cells
print(f"After QC: {adata.n_obs} cells x {adata.n_vars} genes")
Normalize library sizes, log-transform, and identify highly variable genes (HVGs). Store raw counts for later differential expression.
# Preserve raw counts in a layer
adata.layers["counts"] = adata.X.copy()
# Normalize to 10,000 counts per cell, then log-transform
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
# Save normalized data as .raw for visualization
adata.raw = adata
# Identify highly variable genes
sc.pp.highly_variable_genes(adata, n_top_genes=2000, flavor="seurat_v3", layer="counts")
sc.pl.highly_variable_genes(adata)
# Subset to HVGs
adata = adata[:, adata.var.highly_variable]
print(f"HVG subset: {adata.n_obs} cells x {adata.n_vars} genes")
Regress out confounders and scale gene expression values. This prepares data for PCA.
# Regress out unwanted sources of variation
sc.pp.regress_out(adata, ["total_counts", "pct_counts_mt"])
# Scale to unit variance, clip extreme values
sc.pp.scale(adata, max_value=10)
print(f"Scaled matrix: mean={adata.X.mean():.4f}, std={adata.X.std():.4f}")
Compute PCA, inspect the variance ratio elbow plot, then build a neighborhood graph and embed with UMAP.
# PCA
sc.tl.pca(adata, svd_solver="arpack", n_comps=50)
sc.pl.pca_variance_ratio(adata, log=True, n_pcs=50)
# Determine n_pcs from elbow plot (typically 30-50)
n_pcs = 40
# Compute k-nearest neighbor graph
sc.pp.neighbors(adata, n_neighbors=15, n_pcs=n_pcs)
# UMAP embedding
sc.tl.umap(adata)
sc.pl.umap(adata, color=["n_genes_by_counts", "total_counts", "pct_counts_mt"])
print(f"UMAP shape: {adata.obsm['X_umap'].shape}")
Run Leiden clustering at multiple resolutions and select the best granularity by visual inspection.
# Test multiple resolutions
for res in [0.3, 0.5, 0.8, 1.0, 1.2]:
sc.tl.leiden(adata, resolution=res, key_added=f"leiden_res{res}", flavor="igraph", n_iterations=2)
# Compare resolutions side-by-side
sc.pl.umap(
adata,
color=["leiden_res0.3", "leiden_res0.5", "leiden_res0.8", "leiden_res1.0"],
ncols=2,
legend_loc="on data",
)
# Pick final resolution based on biological plausibility
adata.obs["leiden"] = adata.obs["leiden_res0.5"]
n_clusters = adata.obs["leiden"].nunique()
print(f"Selected resolution=0.5: {n_clusters} clusters")
Find differentially expressed genes per cluster using Wilcoxon rank-sum test on raw counts.
# Rank genes per cluster
sc.tl.rank_genes_groups(adata, groupby="leiden", method="wilcoxon", use_raw=True)
# Visualization
sc.pl.rank_genes_groups(adata, n_genes=20, sharey=False)
sc.pl.rank_genes_groups_dotplot(adata, n_genes=5)
sc.pl.rank_genes_groups_heatmap(adata, n_genes=10, use_raw=True, show_gene_labels=True)
# Extract as DataFrame
markers_df = sc.get.rank_genes_groups_df(adata, group=None)
top_markers = markers_df[markers_df["pvals_adj"] < 0.05].groupby("group").head(10)
print(f"Significant markers: {len(top_markers)} genes across {top_markers['group'].nunique()} clusters")
print(top_markers[["group", "names", "logfoldchanges", "pvals_adj"]].head(15))
Assign cell types based on canonical marker genes, visualize, and save all results.
# Define canonical markers (example: human PBMC)
marker_genes = {
"CD4 T cells": ["CD3D", "CD3E", "IL7R"],
"CD8 T cells": ["CD8A", "CD8B", "GZMK"],
"B cells": ["MS4A1", "CD79A", "CD79B"],
"NK cells": ["NKG7", "GNLY", "KLRD1"],
"CD14+ Monocytes": ["CD14", "LYZ", "S100A9"],
"FCGR3A+ Monocytes":["FCGR3A", "MS4A7"],
"Dendritic cells": ["FCER1A", "CST3"],
"Platelets": ["PPBP", "PF4"],
}
# Dot plot — rows = clusters, columns = markers
sc.pl.dotplot(adata, var_names=marker_genes, groupby="leiden", use_raw=True)
# Map clusters → cell types (adjust based on your dot plot)
cluster_to_celltype = {
"0": "CD4 T cells",
"1": "CD14+ Monocytes",
"2": "B cells",
"3": "CD8 T cells",
"4": "NK cells",
"5": "FCGR3A+ Monocytes",
"6": "Dendritic cells",
"7": "Platelets",
}
adata.obs["cell_type"] = adata.obs["leiden"].map(cluster_to_celltype).fillna("Unknown")
sc.pl.umap(adata, color="cell_type", legend_loc="on data", frameon=False)
# Save results
adata.write("results/annotated_data.h5ad", compression="gzip")
adata.obs.to_csv("results/cell_metadata.csv")
markers_df.to_csv("results/marker_genes.csv", index=False)
print(f"Saved: {adata.n_obs} cells with {adata.obs['cell_type'].nunique()} cell types")
| Parameter | Default | Range / Options | Effect |
|---|---|---|---|
min_genes (filter_cells) | 200 | 100-500 | Minimum genes detected per cell; lower keeps more cells |
min_cells (filter_genes) | 3 | 3-10 | Minimum cells expressing a gene; higher is stricter |
pct_counts_mt threshold | 20 | 5-30 | Max mitochondrial %; tissue-dependent (5% for PBMCs, 20% for tumors) |
n_top_genes (HVG) | 2000 | 1000-5000 | Number of highly variable genes; more captures subtle variation |
flavor (HVG) | "seurat_v3" | "seurat", "seurat_v3", "cell_ranger" | HVG detection algorithm |
n_pcs (neighbors) | 40 | 20-50 | Number of PCs for neighbor graph; check elbow plot |
n_neighbors | 15 | 5-50 | k for k-NN graph; lower emphasizes local structure |
resolution (leiden) | 0.5 | 0.1-2.0 | Clustering granularity; higher → more clusters |
method (rank_genes) | "wilcoxon" | "wilcoxon", "t-test", "logreg" | DE test; Wilcoxon recommended for publication |
target_sum (normalize) | 1e4 | 1e4-1e5 | Target library size per cell |
When to use: multiple samples/batches with visible batch effects on the UMAP.
# Requires 'batch' column in adata.obs
sc.pp.combat(adata, key="batch")
# Re-run PCA, neighbors, UMAP, clustering after correction
sc.tl.pca(adata, svd_solver="arpack")
sc.pp.neighbors(adata, n_pcs=40)
sc.tl.umap(adata)
sc.pl.umap(adata, color=["batch", "leiden"])
When to use: cells follow a developmental continuum rather than discrete clusters.
# PAGA graph abstraction
sc.tl.paga(adata, groups="leiden")
sc.pl.paga(adata, color="leiden", threshold=0.03)
# Reinitialize UMAP with PAGA layout
sc.tl.umap(adata, init_pos="paga")
# Diffusion pseudotime — set root cell
adata.uns["iroot"] = np.flatnonzero(adata.obs["leiden"] == "0")[0]
sc.tl.diffmap(adata)
sc.tl.dpt(adata)
sc.pl.umap(adata, color=["dpt_pseudotime", "leiden"])
When to use: generating figures for manuscripts or presentations.
sc.settings.set_figure_params(dpi=300, frameon=False, figsize=(5, 5))
# UMAP with custom palette
sc.pl.umap(
adata, color="cell_type",
palette="Set2",
legend_loc="on data",
legend_fontsize=10,
legend_fontoutline=2,
title="",
save="_celltype_publication.pdf",
)
# Stacked violin plot of top markers
flat_markers = [g for genes in marker_genes.values() for g in genes[:2]]
sc.pl.stacked_violin(
adata, var_names=flat_markers, groupby="cell_type",
use_raw=True, swap_axes=True,
save="_markers_violin.pdf",
)
When to use: comparing treated vs control within a specific cell type.
# Subset to cell type of interest
adata_t = adata[adata.obs["cell_type"] == "CD4 T cells"].copy()
# DE between conditions
sc.tl.rank_genes_groups(adata_t, groupby="condition", groups=["treated"], reference="control", method="wilcoxon", use_raw=True)
sc.pl.rank_genes_groups_volcano(adata_t, groups=["treated"])
de_results = sc.get.rank_genes_groups_df(adata_t, group="treated")
sig_genes = de_results[de_results["pvals_adj"] < 0.05]
print(f"Significant DE genes: {len(sig_genes)}")
results/annotated_data.h5ad — Full AnnData with embeddings (PCA, UMAP), cluster labels, cell type annotations, and raw counts layerresults/cell_metadata.csv — Per-cell table: barcode, cluster, cell type, QC metrics (n_genes, total_counts, pct_mt)results/marker_genes.csv — DE genes per cluster: gene name, log fold change, adjusted p-value| Problem | Cause | Solution |
|---|---|---|
ModuleNotFoundError: leidenalg | Leiden package not installed | pip install leidenalg igraph |
MemoryError during PCA | Dense matrix exceeds RAM | Use sc.pp.pca(adata, chunked=True, chunk_size=20000) |
| UMAP shows single blob | Over-filtering or too few HVGs | Relax QC thresholds; increase n_top_genes to 3000-5000 |
| Too many/few clusters | Resolution mismatch | Sweep resolution from 0.1 to 2.0 and compare UMAPs |
KeyError: 'MT-' not found | Wrong species prefix | Use mt- (lowercase) for mouse, MT- for human |
| Batch effects dominate UMAP | Uncorrected technical variation | Apply ComBat recipe or use Harmony/scVI for integration |
adata.raw is None error | Forgot to save raw before subsetting | Set adata.raw = adata after normalization, before HVG subset |
| Marker genes non-specific | Resolution too low merging cell types | Increase resolution or use logistic_regression method |
Slow regress_out | Large cell count | Skip regress_out; use sc.pp.scale() alone (often sufficient) |
ValueError in rank_genes_groups | Cluster with too few cells | Remove clusters with <10 cells before DE: adata = adata[adata.obs.groupby('leiden').filter(lambda x: len(x)>=10).index] |
This skill includes reference files for deeper lookup. Read these on demand when the main SKILL.md needs more detail.
Quick-lookup table of all scanpy functions organized by module (sc.pp., sc.tl., sc.pl.*), with full signatures, common parameters, and AnnData structure reference. Use when: you need the exact function name or parameter for a specific operation.
Comprehensive visualization guide: QC plots, UMAP styling, marker gene plots (dot plot, heatmap, stacked violin), trajectory plots, multi-panel figures, publication customization, and color palette recommendations. Use when: creating figures for publications or presentations.
Detailed step-by-step reference for each pipeline stage with decision points, parameter rationale, and alternative approaches (MAD-based filtering, scran normalization, automated annotation). Use when: performing a full analysis from scratch and needing deeper guidance than the Workflow section above.