CellWhisperer
CellWhisperer is a multimodal AI model combining transcriptomics with natural language to enable intuitive interaction with scRNA-seq datasets. CellWhisperer is published in Nature Biotechnology. The project website hosts the web tool with several example datasets as well as a short video tutorial. We also provide our model weights and curated datasets.
This repository contains detailed instructions on how to run your own CellWhisperer instance and import custom datasets, as well as the full source code, models, and training data.
Table of Contents
Installation
Installing a local copy of CellWhisperer allows you to analyze your own datasets and explore scRNA-seq data interactively using the CellWhisperer AI model. The installation process takes approximately 15 minutes and supports both CPU and GPU (CUDA 12) environments.
Option A: Pixi (recommended for Mac&Linux)
Pixi, very similarly to uv, provides a fast, reproducible setup with a single command.
-
Clone the repository with all submodules:
git clone git@github.com:epigen/cellwhisperer.git --recurse-submodules
cd cellwhisperer
-
Install:
bash envs/setup_pixi.sh
All dependencies (including snakemake and cellxgene) are resolved automatically from pixi.toml. Use pixi run or pixi shell to execute commands in the environment.
Option B: Conda (Linux-only)
-
Clone the repository with all submodules (required):
git clone git@github.com:epigen/cellwhisperer.git --recurse-submodules
cd cellwhisperer
If you've already cloned without submodules, retrieve them with:
git submodule update --init --recursive
-
Set up the conda environments:
./envs/setup.sh
This script creates the necessary conda environments including cellwhisperer (main environment) and llava (for the chat model).
-
Install snakemake (optional, for running paper analyses):
conda install -c bioconda -n base snakemake=7
Alternatively, snakemake is accessible within the cellwhisperer environment after activation.
-
Verify installation:
Activate the environment and check that cellxgene is available:
conda activate cellwhisperer
cellxgene --version
Note on compilers: If you encounter build issues, you may need to install gcc and g++ (version 9.5 recommended). If installing via conda, be aware of potential compatibility issues with snakemake.
You're now ready to run CellWhisperer locally (see next section) or analyze your own datasets.
Option C: Docker (Best for deployment; Linux-only)
For users who prefer containerized environments, CellWhisperer can be installed and run using Docker. This approach includes all dependencies and installation steps in a self-contained environment.
-
Build the Docker image:
docker build -t cellwhisperer .
-
Run the container:
docker run --gpus all -it --volume .:/opt/cellwhisperer cellwhisperer bash
# Also works without GPUs (omit --gpus all)
-
Activate the environment inside the container:
conda activate cellwhisperer
Note on volumes: The command above mounts the project directory as a volume (--volume .:/opt/cellwhisperer) so that code modifications are visible inside the container. For processing datasets, consider also mounting resources and results directories:
docker run --gpus all -it \
--volume .:/opt/cellwhisperer \
--volume /path/to/resources:/opt/cellwhisperer/resources \
--volume /path/to/results:/opt/cellwhisperer/results \
cellwhisperer bash
Analyze Your Own Datasets
CellWhisperer can analyze your own scRNA-seq datasets through a straightforward three-step process. We currently support human data with raw (unnormalized) read counts.
Processing time: Approximately 2 hours per 10,000 cells on CPU (significantly faster with GPU).
Step 1: Prepare Your Dataset
Place your dataset as h5ad file at <PROJECT_ROOT>/resources/<dataset_name>/read_count_table.h5ad with the following requirements:
Required:
- Raw read counts (int32 format) in
.X or .layers["counts"]
.var must have a unique index (e.g., Ensembl IDs) and a gene_name field with gene symbols
- No NaN values in the count matrix