From binder-design
Computes ESM2 embeddings and PLL scores for protein sequences. Use for plausibility filtering, clustering, variant prediction, and sequence-function analysis.
npx claudepluginhub adaptyvbio/protein-design-skills --plugin adaptyvThis skill uses the workspace's default tool permissions.
| Requirement | Minimum | Recommended |
Generates novel protein sequences, predicts 3D structures from sequences, performs inverse folding, and extracts embeddings using ESM3/ESM C models. Supports local GPU or EvolutionaryScale Forge API.
Generates protein sequences/structures with ESM3, computes embeddings with ESM C. For protein design, inverse folding, function prediction, and engineering via local or Forge API.
Generates protein sequences, predicts structures, performs inverse folding, and creates embeddings using ESM3 and ESM C models for protein design and engineering.
Share bugs, ideas, or general feedback.
| Requirement | Minimum | Recommended |
|---|---|---|
| Python | 3.8+ | 3.10 |
| PyTorch | 1.10+ | 2.0+ |
| CUDA | 11.0+ | 11.7+ |
| GPU VRAM | 8GB | 24GB (A10G) |
| RAM | 16GB | 32GB |
First time? See Installation Guide to set up Modal and biomodals.
cd biomodals
modal run modal_esm2_predict_masked.py \
--input-faa sequences.fasta \
--out-dir embeddings/
GPU: A10G (24GB) | Timeout: 300s default
import torch
import esm
# Load model
model, alphabet = esm.pretrained.esm2_t33_650M_UR50D()
batch_converter = alphabet.get_batch_converter()
model = model.eval().cuda()
# Process sequences
data = [("seq1", "MKTAYIAKQRQISFVK...")]
batch_labels, batch_strs, batch_tokens = batch_converter(data)
with torch.no_grad():
results = model(batch_tokens.cuda(), repr_layers=[33])
# Get embeddings
embeddings = results["representations"][33]
| Model | Parameters | Speed | Quality |
|---|---|---|---|
| esm2_t6_8M | 8M | Fastest | Fast screening |
| esm2_t12_35M | 35M | Fast | Good |
| esm2_t33_650M | 650M | Medium | Better |
| esm2_t36_3B | 3B | Slow | Best |
embeddings/
├── embeddings.npy # (N, 1280) array
├── pll_scores.csv # PLL for each sequence
└── metadata.json # Sequence info
$ modal run modal_esm2_predict_masked.py --input-faa designs.fasta
[INFO] Loading ESM2-650M model...
[INFO] Processing 100 sequences...
[INFO] Computing pseudo-log-likelihood...
embeddings/pll_scores.csv:
sequence_id,pll,pll_normalized,length
design_0,-0.82,0.15,78
design_1,-0.95,0.08,85
design_2,-1.23,-0.12,72
...
Summary:
Mean PLL: -0.91
Sequences with PLL > 0: 42/100 (42%)
What good output looks like:
Should I use ESM2?
│
├─ What do you need?
│ ├─ Sequence plausibility score → ESM2 PLL ✓
│ ├─ Embeddings for clustering → ESM2 ✓
│ ├─ Variant effect prediction → ESM2 ✓
│ └─ Structure prediction → Use ESMFold
│
├─ What model size?
│ ├─ Fast screening → esm2_t12_35M
│ ├─ Standard use → esm2_t33_650M ✓
│ └─ Best quality → esm2_t36_3B
│
└─ Use case?
├─ QC filtering → PLL > 0.0 threshold
├─ Diversity analysis → Mean-pooled embeddings
└─ Mutation scanning → Per-position log-odds
| Normalized PLL | Interpretation |
|---|---|
| > 0.2 | Very natural sequence |
| 0.0 - 0.2 | Good, natural-like |
| -0.5 - 0.0 | Acceptable |
| < -0.5 | May be unnatural |
| Campaign Size | Time (A10G) | Cost (Modal) | Notes |
|---|---|---|---|
| 100 sequences | 5-10 min | ~$1 | Quick screen |
| 1000 sequences | 30-60 min | ~$5 | Standard |
| 5000 sequences | 2-3h | ~$20 | Large batch |
Throughput: ~100-200 sequences/minute with 650M model.
wc -l embeddings/pll_scores.csv # Should match input + 1 (header)
OOM errors: Use smaller model or batch sequences Slow processing: Use esm2_t12_35M for speed Low PLL scores: May indicate unusual/designed sequences
| Error | Cause | Fix |
|---|---|---|
RuntimeError: CUDA out of memory | Sequence too long or large batch | Reduce batch size |
KeyError: representation | Wrong layer requested | Use layer 33 for 650M model |
ValueError: sequence | Invalid amino acid | Check for non-standard AAs |
Next: Structure prediction with chai or boltz → protein-qc for filtering.