From sciagent-skills
Annotates scRNA-seq query datasets with consensus cell types from labeled references using 10+ algorithms (KNN-Harmony, CellTypist, scVI, etc.) via majority voting. Outputs per-method labels and agreement scores for uncertainty.
npx claudepluginhub jaechang-hits/sciagent-skills --plugin sciagent-skillsThis skill uses the workspace's default tool permissions.
popV (Population Voting for single-cell annotation) annotates a query scRNA-seq dataset by running 10+ independent classification algorithms against a labeled reference atlas and aggregating results via majority voting. Each method produces its own label; the final `popv_prediction` is the consensus across all methods, and the `popv_agreement` score quantifies how many methods agree. This ensem...
Provides three-tier decision framework for scRNA-seq cell type annotation: manual marker-based, CellTypist automated, popV reference-based transfer. Use for planning or troubleshooting.
Analyzes scRNA-seq data using scanpy/anndata: QC, normalization, PCA/UMAP, Leiden clustering, DE (Wilcoxon/DESeq2), annotation, batch correction, trajectory, cell-cell communication via ligand-receptor pairs. Supports h5ad/10X/CSV.
Analyzes single-cell RNA-seq data with Scanpy: QC, normalization, dimensionality reduction (PCA/UMAP/t-SNE), clustering, marker genes, visualization, trajectory inference. For exploratory workflows.
Share bugs, ideas, or general feedback.
popV (Population Voting for single-cell annotation) annotates a query scRNA-seq dataset by running 10+ independent classification algorithms against a labeled reference atlas and aggregating results via majority voting. Each method produces its own label; the final popv_prediction is the consensus across all methods, and the popv_agreement score quantifies how many methods agree. This ensemble strategy is robust to individual method failures on unusual datasets and provides a principled uncertainty estimate: low agreement highlights novel cell states or annotation gaps.
popv_agreement score)popv>=0.6, scanpy>=1.9, anndata, scvi-tools>=1.0, harmonypy, bbknn, celltypistadata_ref) with cell type labels in obs, and an unlabeled query (adata_query). Both must be from the same species and have overlapping gene sets. Raw counts in adata.X (popV applies its own normalization internally)pip install popv scvi-tools harmonypy bbknn celltypist
Minimal pipeline from labeled reference and unlabeled query to annotated result:
import popv
import scanpy as sc
# Load reference (labeled) and query (unlabeled) AnnData objects
adata_ref = sc.read_h5ad("reference_atlas.h5ad") # adata_ref.obs["cell_type"] exists
adata_query = sc.read_h5ad("query_dataset.h5ad")
# Prepare combined object with popV preprocessing
adata = popv.preprocessing.Process_Query(
adata_ref,
adata_query,
ref_labels_key="cell_type",
ref_batch_key="batch",
query_batch_key="batch",
unknown_celltype_label="unknown",
save_path_trained_models="./popv_models/",
n_epochs_unsupervised=50,
)
# Run all annotation methods
popv.annotation.annotate_data(adata)
# Inspect consensus results for query cells
query_mask = adata.obs["_dataset"] == "query"
print(adata[query_mask].obs[["popv_prediction", "popv_agreement"]].head(10))
Both AnnData objects must share a gene space and have required metadata columns. popV will subset to the intersection of genes automatically.
import anndata as ad
import scanpy as sc
import numpy as np
# Reference: must have cell type labels and (optionally) batch metadata
adata_ref = sc.read_h5ad("reference_atlas.h5ad")
print(f"Reference: {adata_ref.n_obs} cells x {adata_ref.n_vars} genes")
print(f"Cell types: {adata_ref.obs['cell_type'].nunique()} unique labels")
print(f"Reference cell type counts:\n{adata_ref.obs['cell_type'].value_counts().head(10)}")
# Query: no labels required; batch metadata optional
adata_query = sc.read_h5ad("query_dataset.h5ad")
print(f"\nQuery: {adata_query.n_obs} cells x {adata_query.n_vars} genes")
# Check gene overlap (popV will handle subsetting but >70% overlap is recommended)
shared_genes = adata_ref.var_names.intersection(adata_query.var_names)
pct_shared = len(shared_genes) / adata_ref.n_vars
print(f"\nShared genes: {len(shared_genes)} ({pct_shared:.1%} of reference genes)")
if pct_shared < 0.5:
print("WARNING: <50% gene overlap — annotation quality may be reduced")
# Verify required fields before popV setup
assert "cell_type" in adata_ref.obs.columns, "Reference needs cell type labels"
# Add batch column if absent (popV requires it even for single-batch data)
if "batch" not in adata_ref.obs.columns:
adata_ref.obs["batch"] = "ref_batch"
if "batch" not in adata_query.obs.columns:
adata_query.obs["batch"] = "query_batch"
print("Reference obs columns:", adata_ref.obs.columns.tolist())
print("Query obs columns: ", adata_query.obs.columns.tolist())
Process_Query combines reference and query, normalizes counts, selects HVGs, and prepares the joint embedding needed by all annotation methods.
import popv
# Create processed combined AnnData
adata = popv.preprocessing.Process_Query(
adata_ref,
adata_query,
ref_labels_key="cell_type", # obs column with reference labels
ref_batch_key="batch", # obs column with reference batch info
query_batch_key="batch", # obs column with query batch info
unknown_celltype_label="unknown",# label to use for query cells before annotation
save_path_trained_models="./popv_models/", # directory for scVI/SCANVI model checkpoints
n_epochs_unsupervised=50, # scVI training epochs (increase to 100–200 for large datasets)
n_epochs_semisupervised=20, # scANVI fine-tuning epochs
use_gpu=True, # GPU for scVI/SCANVI (falls back to CPU if unavailable)
hvg=4000, # number of highly variable genes to use
)
print(f"Combined object: {adata.n_obs} cells x {adata.n_vars} genes")
print(f"Dataset labels: {adata.obs['_dataset'].value_counts().to_dict()}")
# Expected: {'ref': N_ref, 'query': N_query}
annotate_data runs all selected methods sequentially and adds per-method label columns plus the consensus to adata.obs.
import popv
# Run annotation with default set of methods
popv.annotation.annotate_data(
adata,
methods=[
"knn_harmony", # KNN on Harmony-corrected embedding
"knn_bbknn", # KNN on BBKNN cross-batch graph
"knn_scvi", # KNN on scVI latent space
"scanvi_popv", # Semi-supervised scANVI label transfer
"celltypist_popv",# CellTypist logistic regression
"rf", # Random Forest on HVG expression
"xgboost", # XGBoost classifier
"svm", # Support Vector Machine
"onclass", # ONCLASS (ontology-guided)
],
)
# Inspect per-method result columns (all end in "_popv")
query_mask = adata.obs["_dataset"] == "query"
popv_cols = adata.obs.filter(like="_popv").columns.tolist()
print(f"Per-method columns: {popv_cols}")
print(adata[query_mask].obs[popv_cols + ["popv_prediction", "popv_agreement"]].head(10))
popv_prediction is the majority-vote consensus; popv_agreement is the fraction of methods that agreed on the winning label.
import pandas as pd
query_mask = adata.obs["_dataset"] == "query"
query_obs = adata[query_mask].obs.copy()
# Consensus label distribution
print("Consensus cell type distribution:")
print(query_obs["popv_prediction"].value_counts().head(15))
# Agreement score statistics
print(f"\npopv_agreement statistics:")
print(query_obs["popv_agreement"].describe())
# agreement = 1.0 → all methods agree; agreement = 0.2 → only 2/10 methods agree
# Cells with high confidence (>80% method agreement)
high_conf = query_obs["popv_agreement"] >= 0.8
print(f"\nHigh-confidence cells (agreement >= 0.8): {high_conf.sum()} ({high_conf.mean():.1%})")
# Cells with low confidence — candidate novel states or annotation gaps
low_conf = query_obs["popv_agreement"] < 0.5
print(f"Low-confidence cells (agreement < 0.5): {low_conf.sum()} ({low_conf.mean():.1%})")
popV provides built-in UMAP and heatmap visualization of per-method agreement and consensus labels.
import popv
import scanpy as sc
import matplotlib.pyplot as plt
# Compute UMAP on the joint reference+query embedding (if not already present)
if "X_umap" not in adata.obsm:
sc.tl.umap(adata)
# popV built-in visualization: UMAP panel showing consensus + agreement
popv.visualization.predict_celltypes_umap(
adata,
save="popv_annotation_umap.png",
)
print("Saved popv_annotation_umap.png")
# Custom UMAP panels
fig, axes = plt.subplots(1, 3, figsize=(21, 6))
sc.pl.umap(adata, color="popv_prediction", ax=axes[0],
title="popV Consensus", legend_loc="on data",
legend_fontsize=6, show=False)
sc.pl.umap(adata, color="popv_agreement", ax=axes[1],
cmap="RdYlGn", vmin=0, vmax=1,
title="Method Agreement Score", show=False)
sc.pl.umap(adata, color="_dataset", ax=axes[2],
title="Reference vs Query", show=False)
plt.tight_layout()
plt.savefig("popv_custom_umap.png", dpi=150, bbox_inches="tight")
print("Saved popv_custom_umap.png")
popV runs each method independently; the final prediction is determined by plurality vote across all methods. The popv_agreement score equals the fraction of methods that voted for the winning label (e.g., 0.7 = 7/10 methods agreed). This design has several properties:
| Method | Batch Correction | Speed | Best For |
|---|---|---|---|
knn_harmony | Harmony | Fast | Moderate batch effects, large datasets |
knn_bbknn | BBKNN | Fast | Diverse multi-tissue references |
knn_scanorama | Scanorama | Fast | Multiple heterogeneous batches |
knn_scvi | scVI VAE | Medium | Complex batch effects, probabilistic embedding |
scanvi_popv | scVI+labels | Slow | Semi-supervised; most accurate when reference is clean |
celltypist_popv | None (logistic) | Fast | Immune cells; works well without batch correction |
rf | None | Medium | Balanced class distributions; interpretable feature importance |
xgboost | None | Medium | High-confidence predictions on well-separated cell types |
svm | None | Medium | High-dimensional gene expression; linear boundaries |
onclass | None | Medium | Ontology-aware; handles unseen cell types via CL ontology |
ONCLASS uses the Cell Ontology (CL) to represent cell types as nodes in a knowledge graph and predict unseen cell types by propagating similarity through the ontology. Unlike other methods, ONCLASS can predict a cell type that was not present in the training reference if it is ontologically adjacent to known types. Enable it by including "onclass" in the methods list.
popV annotation quality scales directly with reference quality:
Goal: Annotate an unlabeled query dataset using a curated reference atlas end-to-end.
import popv
import scanpy as sc
import pandas as pd
# 1. Load data
adata_ref = sc.read_h5ad("reference_atlas.h5ad") # has obs["cell_type"] and obs["batch"]
adata_query = sc.read_h5ad("query_dataset.h5ad") # no cell type labels
if "batch" not in adata_query.obs.columns:
adata_query.obs["batch"] = "query"
# 2. Preprocess: build joint normalized object
adata = popv.preprocessing.Process_Query(
adata_ref,
adata_query,
ref_labels_key="cell_type",
ref_batch_key="batch",
query_batch_key="batch",
unknown_celltype_label="unknown",
save_path_trained_models="./popv_models/",
n_epochs_unsupervised=100,
n_epochs_semisupervised=30,
use_gpu=True,
hvg=4000,
)
print(f"Prepared: {adata.n_obs} total cells")
# 3. Run ensemble annotation
popv.annotation.annotate_data(adata)
# 4. Extract query results
query_mask = adata.obs["_dataset"] == "query"
query_annotations = adata[query_mask].obs[[
"popv_prediction", "popv_agreement",
"knn_harmony_popv", "scanvi_popv", "rf_popv", "xgboost_popv"
]].copy()
# 5. Transfer back to original query object
adata_query.obs = adata_query.obs.join(
query_annotations, how="left"
)
print(f"Annotated {query_mask.sum()} query cells")
print(query_annotations["popv_prediction"].value_counts().head(10))
# 6. Save annotated query
adata_query.write_h5ad("annotated_query.h5ad", compression="gzip")
query_annotations.to_csv("popv_annotations.csv")
print("Saved annotated_query.h5ad and popv_annotations.csv")
Goal: Separate high-confidence annotations from ambiguous cells; flag candidate novel or transitional states for manual review.
import popv
import scanpy as sc
import pandas as pd
import matplotlib.pyplot as plt
# Assume adata has been annotated (as in Workflow 1)
query_mask = adata.obs["_dataset"] == "query"
query_obs = adata[query_mask].obs.copy()
# Tier cells by agreement score
bins = [0.0, 0.5, 0.8, 1.01]
labels = ["low (<0.5)", "medium (0.5–0.8)", "high (≥0.8)"]
query_obs["confidence_tier"] = pd.cut(
query_obs["popv_agreement"], bins=bins, labels=labels, right=False
)
print("Cells per confidence tier:")
print(query_obs["confidence_tier"].value_counts())
# High-confidence subset: use popv_prediction directly
high_conf_mask = query_obs["popv_agreement"] >= 0.8
print(f"\nHigh-confidence annotations ({high_conf_mask.mean():.1%} of query cells):")
print(query_obs[high_conf_mask]["popv_prediction"].value_counts().head(10))
# Low-confidence subset: inspect per-method disagreement
low_conf = query_obs[query_obs["popv_agreement"] < 0.5]
popv_method_cols = [c for c in query_obs.columns if c.endswith("_popv") and
c not in ("popv_prediction", "popv_agreement")]
print(f"\nLow-confidence cells sample (showing per-method labels):")
print(low_conf[popv_method_cols + ["popv_prediction"]].head(10).to_string())
# Visualize agreement distribution
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
query_obs["popv_agreement"].hist(bins=20, ax=axes[0], color="steelblue", edgecolor="white")
axes[0].axvline(0.8, color="red", linestyle="--", label="High-confidence threshold")
axes[0].set_xlabel("Method Agreement Score")
axes[0].set_ylabel("Cell Count")
axes[0].set_title("popV Agreement Distribution")
axes[0].legend()
query_obs["confidence_tier"].value_counts().plot.bar(ax=axes[1], color="steelblue")
axes[1].set_title("Cells by Confidence Tier")
axes[1].set_xlabel("Confidence Tier")
axes[1].set_ylabel("Cell Count")
plt.tight_layout()
plt.savefig("popv_confidence_distribution.png", dpi=150, bbox_inches="tight")
print("Saved popv_confidence_distribution.png")
| Parameter | Module | Default | Range / Options | Effect |
|---|---|---|---|---|
ref_labels_key | Process_Query | — | Any obs column | Column in adata_ref.obs containing training cell type labels |
n_epochs_unsupervised | Process_Query | 50 | 20–500 | scVI training epochs; increase for better embedding on large/complex datasets |
n_epochs_semisupervised | Process_Query | 20 | 10–100 | scANVI fine-tuning epochs on top of scVI |
hvg | Process_Query | 4000 | 2000–8000 | Highly variable genes used for embedding and KNN methods |
use_gpu | Process_Query | True | True, False | GPU acceleration for scVI/SCANVI; falls back to CPU automatically if no GPU |
methods | annotate_data | all | List of method names | Subset of methods to run; excluding slow methods (scanvi, onclass) speeds up pipeline |
unknown_celltype_label | Process_Query | "unknown" | Any string | Label assigned to query cells before annotation; used to separate reference labels from query |
popv_agreement | (output) | — | 0.0–1.0 | Fraction of methods agreeing on consensus label; >=0.8 recommended for high confidence |
Check gene overlap before running: popV performs best with >70% gene overlap between reference and query. If overlap is <50%, annotation quality degrades significantly — consider using a different reference or imputing missing genes.
shared = adata_ref.var_names.intersection(adata_query.var_names)
print(f"Gene overlap: {len(shared) / adata_ref.n_vars:.1%}")
Use raw counts as input: pass raw (un-normalized) counts in adata.X to Process_Query. popV internally applies its own normalization. Pre-normalized data can distort the scVI/SCANVI latent space.
Match reference granularity to query biology: if your query contains subtypes not in the reference, no method will correctly assign them — they will appear as low-agreement cells. Either add them to the reference or accept that the consensus will assign the nearest parent type.
Exclude slow methods when speed matters: scanvi_popv and onclass are the slowest. For a quick first-pass, run only knn_harmony, knn_bbknn, rf, xgboost, and celltypist_popv.
popv.annotation.annotate_data(adata, methods=["knn_harmony", "knn_bbknn", "rf", "xgboost", "celltypist_popv"])
Save trained models for repeated queries: Process_Query stores scVI/SCANVI models in save_path_trained_models. Reuse these when annotating additional query batches against the same reference to avoid retraining.
When to use: downstream analyses (DE, trajectory) require clean labels; exclude ambiguous cells.
import scanpy as sc
# Annotate as in Workflow 1 first
query_mask = adata.obs["_dataset"] == "query"
adata_query_annotated = adata[query_mask].copy()
# Keep only high-confidence cells
high_conf = adata_query_annotated[adata_query_annotated.obs["popv_agreement"] >= 0.8].copy()
print(f"High-confidence cells: {high_conf.n_obs} / {adata_query_annotated.n_obs} "
f"({high_conf.n_obs/adata_query_annotated.n_obs:.1%})")
print(high_conf.obs["popv_prediction"].value_counts())
# Recompute UMAP on high-confidence subset for visualization
sc.pp.neighbors(high_conf, use_rep="X_scVI") # use scVI embedding stored by popV
sc.tl.umap(high_conf)
sc.pl.umap(high_conf, color="popv_prediction", save="_high_conf_celltypes.png")
When to use: understanding where methods disagree to identify systematic biases or novel populations.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
query_mask = adata.obs["_dataset"] == "query"
query_obs = adata[query_mask].obs.copy()
# Collect per-method columns
method_cols = [c for c in query_obs.columns
if c.endswith("_popv") and c not in ("popv_prediction", "popv_agreement")]
# Cross-tabulate two key methods
ct = pd.crosstab(
query_obs["knn_harmony_popv"],
query_obs["scanvi_popv"],
margins=False,
)
# Normalize rows
ct_norm = ct.div(ct.sum(axis=1), axis=0)
plt.figure(figsize=(12, 10))
sns.heatmap(ct_norm, cmap="Blues", vmin=0, vmax=1,
xticklabels=True, yticklabels=True,
cbar_kws={"label": "Fraction of cells"})
plt.title("knn_harmony vs scanvi label agreement")
plt.xlabel("SCANVI label")
plt.ylabel("KNN-Harmony label")
plt.tight_layout()
plt.savefig("popv_method_agreement_heatmap.png", dpi=150)
print("Saved popv_method_agreement_heatmap.png")
When to use: quick annotation without GPU or when scVI/SCANVI training is prohibitively slow (>500k cells).
import popv
# Process without training deep generative models (scVI not needed for KNN-Harmony)
adata = popv.preprocessing.Process_Query(
adata_ref,
adata_query,
ref_labels_key="cell_type",
ref_batch_key="batch",
query_batch_key="batch",
unknown_celltype_label="unknown",
save_path_trained_models="./popv_models/",
n_epochs_unsupervised=0, # skip scVI training
n_epochs_semisupervised=0, # skip scANVI training
use_gpu=False,
hvg=3000,
)
# Run only fast non-DL methods
popv.annotation.annotate_data(
adata,
methods=["knn_harmony", "knn_bbknn", "knn_scanorama", "rf", "xgboost", "svm", "celltypist_popv"],
)
query_mask = adata.obs["_dataset"] == "query"
print(adata[query_mask].obs[["popv_prediction", "popv_agreement"]].describe())
| Problem | Cause | Solution |
|---|---|---|
KeyError: ref_labels_key not in adata_ref.obs | Reference lacks a cell type column | Verify the column name: print(adata_ref.obs.columns.tolist()); update ref_labels_key accordingly |
| Gene space mismatch error | Reference and query have very few shared genes | Check adata_ref.var_names.intersection(adata_query.var_names); if <50% overlap, use a different reference or match gene panels |
| CUDA out-of-memory for scVI/SCANVI | GPU VRAM insufficient for batch size | Set use_gpu=False or reduce n_epochs_unsupervised; scVI falls back to CPU automatically on most systems |
onclass_popv failures on small datasets | ONCLASS requires sufficient label coverage | Remove "onclass" from the methods list when reference has <10 cell types or <500 cells per type |
| Very slow annotation (>2 hours) | scVI/SCANVI training on large reference | Subsample reference to 50k cells per type; exclude "scanvi_popv" and "onclass" from methods |
| All cells receive same consensus label | Reference highly imbalanced toward one type | Balance reference by subsampling the dominant type or upsampling rare types before running popV |
popv_agreement is 0 for many cells | Many methods returning different labels | Inspect per-method columns; consider whether reference covers the query biology; add methods or retrain with a better reference |
knn_harmony method internally; understand it to tune popV's KNN-based methods