CellWhisperer

CellWhisperer is a multimodal AI model combining transcriptomics with natural language to enable intuitive interaction with scRNA-seq datasets. CellWhisperer is published in Nature Biotechnology. The project website hosts the web tool with several example datasets as well as a short video tutorial. We also provide our model weights and curated datasets.

This repository contains detailed instructions on how to run your own CellWhisperer instance and import custom datasets, as well as the full source code, models, and training data.

Installation
Analyze Your Own Datasets
Folder Structure
Reproducing Paper Analyses
Citation and Contact

Installation

Installing a local copy of CellWhisperer allows you to analyze your own datasets and explore scRNA-seq data interactively using the CellWhisperer AI model. The installation process takes approximately 15 minutes and supports both CPU and GPU (CUDA 12) environments.

Option A: Pixi (recommended for Mac&Linux)

Pixi, very similarly to uv, provides a fast, reproducible setup with a single command.

Clone the repository with all submodules:

git clone [email protected]:epigen/cellwhisperer.git --recurse-submodules
cd cellwhisperer

Install:
```
bash envs/setup_pixi.sh
```

All dependencies (including snakemake and cellxgene) are resolved automatically from pixi.toml. Use pixi run or pixi shell to execute commands in the environment.

Option B: Conda (Linux-only)

Clone the repository with all submodules (required):

git clone [email protected]:epigen/cellwhisperer.git --recurse-submodules
cd cellwhisperer

If you've already cloned without submodules, retrieve them with:

git submodule update --init --recursive

Set up the conda environments:
```
./envs/setup.sh
```
This script creates the necessary conda environments including cellwhisperer (main environment) and llava (for the chat model).
Install snakemake (optional, for running paper analyses):
```
conda install -c bioconda -n base snakemake=7
```
Alternatively, snakemake is accessible within the cellwhisperer environment after activation.
Verify installation: Activate the environment and check that cellxgene is available:
```
conda activate cellwhisperer
cellxgene --version
```

Note on compilers: If you encounter build issues, you may need to install gcc and g++ (version 9.5 recommended). If installing via conda, be aware of potential compatibility issues with snakemake.

You're now ready to run CellWhisperer locally (see next section) or analyze your own datasets.

Option C: Docker (Best for deployment; Linux-only)

For users who prefer containerized environments, CellWhisperer can be installed and run using Docker. This approach includes all dependencies and installation steps in a self-contained environment.

Build the Docker image:
```
docker build -t cellwhisperer .
```

Run the container:

docker run --gpus all -it --volume .:/opt/cellwhisperer cellwhisperer bash
# Also works without GPUs (omit --gpus all)

Activate the environment inside the container:
```
conda activate cellwhisperer
```

Note on volumes: The command above mounts the project directory as a volume (--volume .:/opt/cellwhisperer) so that code modifications are visible inside the container. For processing datasets, consider also mounting resources and results directories:

docker run --gpus all -it \
  --volume .:/opt/cellwhisperer \
  --volume /path/to/resources:/opt/cellwhisperer/resources \
  --volume /path/to/results:/opt/cellwhisperer/results \
  cellwhisperer bash

Analyze Your Own Datasets

CellWhisperer can analyze your own scRNA-seq datasets through a straightforward three-step process. We currently support human data with raw (unnormalized) read counts.

Processing time: Approximately 2 hours per 10,000 cells on CPU (significantly faster with GPU).

Step 1: Prepare Your Dataset

Place your dataset as h5ad file at <PROJECT_ROOT>/resources/<dataset_name>/read_count_table.h5ad with the following requirements:

Required:

Raw read counts (int32 format) in .X or .layers["counts"]
.var must have a unique index (e.g., Ensembl IDs) and a gene_name field with gene symbols
No NaN values in the count matrix

cellwhisperer

Popularity

What's Inside

README

CellWhisperer

Table of Contents

Installation

Option A: Pixi (recommended for Mac&Linux)

Option B: Conda (Linux-only)

Option C: Docker (Best for deployment; Linux-only)

Analyze Your Own Datasets

Step 1: Prepare Your Dataset

Confidence

Similar Plugins

owkin

mycelium

sciagent-skills

encode-toolkit

bio-research

fullstack-dev-skills