From cellwhisperer
Processes scRNA-seq h5ad datasets with CellWhisperer: annotates cell types and states via text queries, launches cellxgene web app, or runs local Python inference and API scoring.
npx claudepluginhub epigen/cellwhisperer --plugin cellwhispererThis skill uses the workspace's default tool permissions.
CellWhisperer is a multimodal AI model combining transcriptomics with natural language to enable intuitive interaction with scRNA-seq datasets. Published in [Nature Biotechnology](https://doi.org/10.1038/s41587-025-02857-9).
Analyzes single-cell RNA-seq data with Scanpy: quality control, normalization, dimensionality reduction, clustering, marker genes, visualization, trajectory inference. Supports .h5ad, 10X, CSV.
Analyzes scRNA-seq data using scanpy/anndata: QC, normalization, PCA/UMAP, Leiden clustering, DE (Wilcoxon/DESeq2), annotation, batch correction, trajectory, cell-cell communication via ligand-receptor pairs. Supports h5ad/10X/CSV.
Analyzes single-cell RNA-seq data with Scanpy: QC, normalization, dimensionality reduction (PCA/UMAP/t-SNE), clustering, marker genes, visualization, trajectory inference. For exploratory workflows.
Share bugs, ideas, or general feedback.
CellWhisperer is a multimodal AI model combining transcriptomics with natural language to enable intuitive interaction with scRNA-seq datasets. Published in Nature Biotechnology.
This skill provides three capabilities:
cellwhisperer.bocklab.org to embed texts and score/annotate cells on demand, without local model installation.# generally prevent auto-update for your safety
claude plugin marketplace add epigen/cellwhisperer@v0.1.0
claude plugin install cellwhisperer@cellwhisperer
After installing, restart Claude Code or run /reload-plugins. The skill becomes available as /cellwhisperer or is invoked automatically when CellWhisperer-related tasks are detected.
CellWhisperer uses pixi for environment management.
git clone git@github.com:epigen/cellwhisperer.git --recurse-submodules
cd cellwhisperer
All commands below should be run from the CellWhisperer project root using pixi run.
Before starting, read the project README for full context:
README.md — installation, dataset format, web app launch, paper reproductionGoal: take a user's h5ad file from raw counts to an interactive CellWhisperer-powered cellxgene browser.
Place the h5ad file at resources/<dataset_name>/read_count_table.h5ad.
Requirements (validate before proceeding):
.X or .layers["counts"] (int32, no NaN).var must have a unique index and a gene_name column with gene symbolsensembl_id in .var (computed if missing)categorical dtype for categorical .obs columns.obsm must be np.ndarray (not DataFrame), dtype float/int, shape (n_obs, >=2), no Inf valuesWrite a validation script if the user's data needs checking. Common issues:
.layers["counts"]gene_name column → copy index to gene_namecd src/cellxgene_preprocessing
pixi run snakemake --cores 8 --config 'datasets=["<dataset_name>"]'
Key notes:
CUDA_VISIBLE_DEVICES to select GPU.--cores (e.g. 32).OPENAI_API_KEY env var). Without it, falls back to a local Mixtral model (requires 40GB VRAM GPU).results/<dataset_name>/.pixi run cellxgene launch -p 5005 --host 0.0.0.0 --max-category-items 500 \
--var-names gene_name \
results/<dataset_name>/cellwhisperer_clip_v1/cellxgene.h5ad
Access at http://localhost:5005. The web app connects to the hosted CellWhisperer API at cellwhisperer.bocklab.org for AI features (search, chat).
To self-host the embedding model (4GB VRAM), add:
--cellwhisperer-clip-model results/models/jointemb/cellwhisperer_clip_v1.ckpt
Use the hosted CellWhisperer API to embed text queries and score cells without installing the full model locally. This is useful when an agent or script needs quick cell-type annotations or text-transcriptome similarity scores.
Base URL: https://cellwhisperer.bocklab.org/clip/api
import requests
response = requests.get("https://cellwhisperer.bocklab.org/clip/api/logit_scale")
logit_scale = float(response.content)
import pickle
import torch
import requests
texts = ["T cell", "B cell", "monocyte"]
response = requests.post(
"https://cellwhisperer.bocklab.org/clip/api/text_embedding",
json=texts,
)
text_embeds = torch.from_numpy(pickle.loads(response.content))
# Shape: (len(texts), embedding_dim)
Once you have text embeddings and precomputed transcriptome embeddings (from adata.obsm["transcriptome_embeds"] in a processed dataset), compute similarity:
import torch
# text_embeds: (n_texts, embedding_dim) from API
# transcriptome_embeds: (n_cells, embedding_dim) from adata.obsm["transcriptome_embeds"]
transcriptome_embeds = torch.from_numpy(adata.obsm["transcriptome_embeds"])
scores = torch.matmul(text_embeds, transcriptome_embeds.t()) * logit_scale
# Shape: (n_texts, n_cells) - higher score = stronger match
For quick annotation of cells that already have precomputed transcriptome embeddings:
import pickle
import requests
import torch
import numpy as np
import anndata
# Load a CellWhisperer-processed dataset
adata = anndata.read_h5ad("results/<dataset>/cellwhisperer_clip_v1/cellxgene.h5ad")
transcriptome_embeds = torch.from_numpy(adata.obsm["transcriptome_embeds"])
# Get model parameters from API
logit_scale = float(requests.get("https://cellwhisperer.bocklab.org/clip/api/logit_scale").content)
# Embed query terms
queries = ["CD8+ cytotoxic T cell", "naive B cell", "classical monocyte"]
response = requests.post("https://cellwhisperer.bocklab.org/clip/api/text_embedding", json=queries)
text_embeds = torch.from_numpy(pickle.loads(response.content))
# Compute per-cell scores
scores = (torch.matmul(text_embeds, transcriptome_embeds.t()) * logit_scale).detach()
# Assign best-matching label per cell
best_labels = [queries[i] for i in scores.argmax(dim=0)]
adata.obs["cellwhisperer_label"] = best_labels
When explicitly requested, install CellWhisperer as a Python library for local model loading and inference (no API dependency).
It uses pixi for dependency management. Infer the user about implications, i.e. that their project would need to be run within pixi, and that pixi would need to be installed (which you could take care of). There is also the option to adapt the environment for uv (or pip), but this is untested
# From the cellwhisperer repo root
pixi run pip install -e .
Note: this pulls in substantial dependencies (PyTorch, transformers, geneformer). A GPU with >=4GB VRAM is recommended for inference. On CPU, embedding is significantly slower. For quick scoring without local model installation, prefer Feature 2 (API-based scoring).
from cellwhisperer.utils.model_io import load_cellwhisperer_model
# Load from a checkpoint file
pl_model, tokenizer, transcriptome_processor = load_cellwhisperer_model(
"results/models/jointemb/cellwhisperer_clip_v1.ckpt",
cache=True, # enables embedding caching for repeated calls
)
logit_scale = pl_model.model.discriminator.temperature.exp()
Model weights can be downloaded from the project website.
import anndata
from cellwhisperer.utils.processing import adata_to_embeds
adata = anndata.read_h5ad("resources/<dataset>/read_count_table.h5ad")
# adata.X must contain raw integer counts
# adata.var must have gene_name column (or gene symbols as index)
transcriptome_embeds = adata_to_embeds(
adata,
pl_model.model,
transcriptome_processor,
batch_size=32,
)
# Shape: (n_cells, embedding_dim), L2-normalized
text_embeds = pl_model.model.embed_texts(
["T cell", "B cell", "monocyte"],
chunk_size=128,
)
# Shape: (n_texts, embedding_dim), L2-normalized
from cellwhisperer.utils.inference import score_transcriptomes_vs_texts
scores, group_keys = score_transcriptomes_vs_texts(
transcriptome_input=transcriptome_embeds, # or pass adata directly
text_list_or_text_embeds=text_embeds, # or pass list of strings
logit_scale=logit_scale,
model=pl_model.model, # needed if passing raw adata/strings
transcriptome_processor=transcriptome_processor, # needed if passing raw adata
average_mode=None, # None for per-cell, "embeddings" for per-group average
score_norm_method=None, # "zscore", "softmax", "01norm", or None
)
# scores shape: (n_texts, n_cells)
from cellwhisperer.utils.processing import ensure_raw_counts_adata
ensure_raw_counts_adata(adata)
# Raises ValueError if neither .X nor .layers["counts"] has integer counts
# If .layers["counts"] has raw counts, it swaps them into .X
GCC_7.0.0 not found: Add import pyarrow as the first import in your script.batch_size in adata_to_embeds or score_transcriptomes_vs_texts..var.index to .var["gene_name"].--cores in the snakemake command and expect ~2h per 10k cells on CPU. If GPU is available and , check it's used as intended, and if not suggest to the user to do some environment tests to support this.