From sciagent-skills
Generates novel protein sequences, predicts 3D structures from sequences, performs inverse folding, and extracts embeddings using ESM3/ESM C models. Supports local GPU or EvolutionaryScale Forge API.
npx claudepluginhub jaechang-hits/sciagent-skills --plugin sciagent-skillsThis skill uses the workspace's default tool permissions.
ESM (Evolutionary Scale Modeling) provides pretrained protein language models for generative protein design and representation learning. ESM3 is a multimodal generative model conditioned on sequence, structure, and function simultaneously. ESM C is an efficient embedding model optimized for extracting protein representations for downstream ML tasks.
Generates protein sequences/structures with ESM3, computes embeddings with ESM C. For protein design, inverse folding, function prediction, and engineering via local or Forge API.
Generates protein sequences, predicts structures, performs inverse folding, and creates embeddings using ESM3 and ESM C models for protein design and engineering.
Computes ESM2 embeddings and PLL scores for protein sequences. Use for plausibility filtering, clustering, variant prediction, and sequence-function analysis.
Share bugs, ideas, or general feedback.
ESM (Evolutionary Scale Modeling) provides pretrained protein language models for generative protein design and representation learning. ESM3 is a multimodal generative model conditioned on sequence, structure, and function simultaneously. ESM C is an efficient embedding model optimized for extracting protein representations for downstream ML tasks.
esm (EvolutionaryScale package)pip install esm
# For Forge cloud API
pip install esm[forge]
from esm.models.esmc import ESMC
from esm.sdk.api import ESMProtein
# Load ESM C model for embeddings
model = ESMC.from_pretrained("esmc_600m")
# Create protein from sequence
protein = ESMProtein(sequence="MKTAYIAKQRQISFVKSHFSRQLEERLGLIEVQAPILSRVGDGTQDNLSGAEKAVQVKVKALPDAQFEVVHSLAKWKRQQIAATGFHIIPGDKPDNRAGGYDN")
# Get per-residue embeddings
output = model(protein)
embeddings = output.embeddings # shape: (1, seq_len, embedding_dim)
print(f"Embedding shape: {embeddings.shape}")
# Embedding shape: (1, 101, 1152)
Generate novel protein sequences conditioned on structure, function, or partial sequence.
from esm.models.esm3 import ESM3
from esm.sdk.api import ESM3InferenceClient, ESMProtein, GenerationConfig
# Load ESM3 locally
model = ESM3.from_pretrained("esm3_sm_open_v1")
# Generate from partial sequence (fill in masked positions)
prompt = ESMProtein(sequence="MKTAYIAK____ISFVK____RQLEERLG") # ____ = positions to generate
config = GenerationConfig(track="sequence", num_steps=10, temperature=0.7)
generated = model.generate(prompt, config)
print(f"Generated sequence: {generated.sequence[:50]}...")
# Conditional generation: design sequence for a target structure
from esm.sdk.api import ESMProtein, GenerationConfig
from esm.utils.structure.protein_chain import ProteinChain
# Load target structure from PDB
chain = ProteinChain.from_pdb("target.pdb")
prompt = ESMProtein.from_protein_chain(chain)
prompt.sequence = None # Clear sequence, keep structure
config = GenerationConfig(track="sequence", num_steps=16, temperature=0.5)
designed = model.generate(prompt, config)
print(f"Designed sequence ({len(designed.sequence)} residues): {designed.sequence[:50]}...")
Extract fixed-length representations for downstream ML tasks.
from esm.models.esmc import ESMC
from esm.sdk.api import ESMProtein
import torch
model = ESMC.from_pretrained("esmc_600m") # or "esmc_300m" for lighter model
sequences = [
"MKTAYIAKQRQISFVKSHFSRQLEERLGLIEVQAPILSRVGDGTQDNLSGAEKAVQVKVKALPDAQFEVVHSLAKWKRQQIAATGFHIIPGDKPDNRAGGYDN",
"MKWVTFISLLFLFSSAYSRGVFRRDAHKSEVAHRFKDLGEENFKALVLIAFAQYLQQCPFEDHVKLVNEVTEFAKTCVADESAENCDKS",
]
embeddings = []
for seq in sequences:
protein = ESMProtein(sequence=seq)
output = model(protein)
# Mean-pool per-residue embeddings to get fixed-length vector
mean_emb = output.embeddings.mean(dim=1) # shape: (1, embedding_dim)
embeddings.append(mean_emb)
emb_matrix = torch.cat(embeddings, dim=0)
print(f"Embedding matrix: {emb_matrix.shape}") # (2, 1152)
# Compute pairwise similarity
similarity = torch.cosine_similarity(emb_matrix[0:1], emb_matrix[1:2])
print(f"Cosine similarity: {similarity.item():.4f}")
Predict 3D coordinates from amino acid sequence.
from esm.models.esm3 import ESM3
from esm.sdk.api import ESMProtein, GenerationConfig
model = ESM3.from_pretrained("esm3_sm_open_v1")
protein = ESMProtein(sequence="MKTAYIAKQRQISFVKSHFSRQLEERLGLIEVQAPILSRVGDGTQDNLSGAEKAVQVKVKALPDAQFEVVHSLAKWKRQQIAATGFHIIPGDKPDNRAGGYDN")
# Generate structure from sequence
config = GenerationConfig(track="structure", num_steps=16)
result = model.generate(protein, config)
# Save predicted structure
result.to_pdb("predicted.pdb")
print(f"Saved structure: {len(result.sequence)} residues → predicted.pdb")
Design amino acid sequences that fold into a target 3D structure.
from esm.models.esm3 import ESM3
from esm.sdk.api import ESMProtein, GenerationConfig
from esm.utils.structure.protein_chain import ProteinChain
model = ESM3.from_pretrained("esm3_sm_open_v1")
# Load target structure
chain = ProteinChain.from_pdb("target_structure.pdb")
prompt = ESMProtein.from_protein_chain(chain)
# Clear sequence but keep structure coordinates
prompt.sequence = None
# Generate multiple designs
designs = []
for i in range(5):
config = GenerationConfig(track="sequence", num_steps=16, temperature=0.7)
designed = model.generate(prompt, config)
designs.append(designed.sequence)
print(f"Design {i+1}: {designed.sequence[:40]}...")
print(f"Generated {len(designs)} sequence designs for target structure")
Generate proteins with desired functional annotations (GO terms, enzyme activity).
from esm.models.esm3 import ESM3
from esm.sdk.api import ESMProtein, GenerationConfig
model = ESM3.from_pretrained("esm3_sm_open_v1")
# Condition on functional keywords
protein = ESMProtein(
sequence=None, # generate de novo
function_annotations=["ATP binding", "kinase activity", "protein phosphorylation"],
)
config = GenerationConfig(track="sequence", num_steps=32, temperature=0.7)
result = model.generate(protein, config)
print(f"Function-conditioned sequence: {result.sequence[:50]}...")
print(f"Length: {len(result.sequence)} residues")
Use EvolutionaryScale's cloud inference for large models without local GPU.
from esm.sdk.forge import ESM3ForgeInferenceClient
from esm.sdk.api import ESMProtein, GenerationConfig
# Authenticate (requires FORGE_API_TOKEN env var or explicit token)
client = ESM3ForgeInferenceClient(model="esm3-open-2024-03", token="your_token_here")
protein = ESMProtein(sequence="MKTAYIAKQRQISFVKSHFSRQLEERLG")
config = GenerationConfig(track="structure", num_steps=16)
result = client.generate(protein, config)
result.to_pdb("forge_predicted.pdb")
print("Predicted structure via Forge API → forge_predicted.pdb")
| Feature | ESM3 | ESM C |
|---|---|---|
| Primary use | Generative protein design | Embedding extraction |
| Capabilities | Sequence generation, structure prediction, inverse folding, function conditioning | Per-residue and mean-pooled embeddings |
| Model sizes | esm3_sm_open_v1 (~1.4B params) | esmc_300m, esmc_600m |
| GPU requirement | 8GB+ VRAM | 4GB+ VRAM (esmc_300m: 2GB) |
| Use case | Design new proteins, predict structures | Downstream ML (classification, clustering, regression) |
| Cloud option | Forge API (larger models available) | Local only |
The GenerationConfig controls how ESM3 generates outputs:
track: Which modality to generate ("sequence", "structure", "function")num_steps: Number of iterative refinement steps (higher = better quality, slower)temperature: Sampling temperature (0.0 = greedy, 0.5-0.7 = diverse, 1.0 = maximum diversity)The central data container holding sequence, structure coordinates, and functional annotations:
.sequence — amino acid string (e.g., "MKTAY...").coordinates — 3D atom positions (Nx3 tensor).function_annotations — list of functional keywordsESMProtein.from_protein_chain() to load from PDB structures.to_pdb() to save predicted structuresGoal: Extract embeddings from protein sequences and train a downstream classifier.
from esm.models.esmc import ESMC
from esm.sdk.api import ESMProtein
import torch
import numpy as np
model = ESMC.from_pretrained("esmc_600m")
# Embed a set of sequences
sequences = ["MKTAY...", "MKWVT...", "MSGLI..."] # replace with actual sequences
labels = [0, 1, 0] # binary labels
embeddings = []
for seq in sequences:
protein = ESMProtein(sequence=seq)
output = model(protein)
mean_emb = output.embeddings.mean(dim=1).detach().cpu().numpy()
embeddings.append(mean_emb.squeeze())
X = np.array(embeddings)
y = np.array(labels)
print(f"Feature matrix: {X.shape}") # (n_samples, 1152)
# Train a simple classifier
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression(max_iter=1000).fit(X, y)
print(f"Training accuracy: {clf.score(X, y):.2f}")
Goal: Design multiple novel sequences that fold into a target structure, then rank by predicted quality.
ProteinChain.from_pdb() (Core API module 4)temperature=0.7 for diversity (Core API module 1)| Parameter | Module/Function | Default | Range / Options | Effect |
|---|---|---|---|---|
num_steps | GenerationConfig | varies | 1–64 | Iterative refinement steps; more = higher quality, slower |
temperature | GenerationConfig | 1.0 | 0.0–1.5 | Sampling diversity; 0.0=greedy, 0.7=balanced, 1.0+=creative |
track | GenerationConfig | — | "sequence", "structure", "function" | Which modality to generate |
| model name | from_pretrained | — | "esm3_sm_open_v1", "esmc_300m", "esmc_600m" | Model size/capability tradeoff |
token | ESM3ForgeInferenceClient | env var | API token string | Forge cloud authentication |
Use ESM C for embedding tasks, ESM3 for generation: ESM C is smaller, faster, and optimized for representation quality. Only use ESM3 when you need generative capabilities (sequence design, structure prediction, inverse folding).
Mean-pool per-residue embeddings for fixed-length representations: ESM C outputs per-residue embeddings (seq_len × dim). For downstream ML that requires fixed-length input, average across the sequence dimension: embeddings.mean(dim=1).
Use temperature 0.5–0.7 for protein design: Temperature 1.0 produces very diverse but potentially non-functional sequences. Temperature 0.5–0.7 balances diversity with quality. Use temperature 0.0 only for deterministic structure prediction.
Increase num_steps for higher-quality generation: More iterative refinement steps improve output quality at the cost of computation time. Use 8–16 steps for quick exploration, 32+ for final designs.
Batch sequences to maximize GPU utilization: Processing one sequence at a time underutilizes the GPU. When embedding many sequences, batch them (limited by VRAM).
Use Forge API for large-scale or large-model inference: The open-weight ESM3 is a smaller variant. For production-quality protein design, the Forge API provides access to larger models.
from esm.models.esmc import ESMC
from esm.sdk.api import ESMProtein
import torch
import numpy as np
model = ESMC.from_pretrained("esmc_300m")
sequences = {
"Protein_A": "MKTAYIAKQRQISFVK...",
"Protein_B": "MKWVTFISLLFLFSSAYS...",
"Protein_C": "MSGLILQRAAVIAAGASSAG...",
}
# Extract embeddings
embs = {}
for name, seq in sequences.items():
protein = ESMProtein(sequence=seq)
output = model(protein)
embs[name] = output.embeddings.mean(dim=1).detach().squeeze()
# Compute similarity matrix
names = list(embs.keys())
sim_matrix = np.zeros((len(names), len(names)))
for i, n1 in enumerate(names):
for j, n2 in enumerate(names):
sim_matrix[i, j] = torch.cosine_similarity(embs[n1].unsqueeze(0), embs[n2].unsqueeze(0)).item()
print("Similarity matrix:")
for i, name in enumerate(names):
print(f" {name}: {sim_matrix[i].round(3)}")
from esm.models.esmc import ESMC
from esm.sdk.api import ESMProtein
import torch
import numpy as np
model = ESMC.from_pretrained("esmc_600m")
# Generate and save
protein = ESMProtein(sequence="MKTAYIAKQRQISFVK...")
output = model(protein)
np.save("embedding.npy", output.embeddings.detach().cpu().numpy())
print("Saved embedding.npy")
# Load later (no GPU needed)
embedding = np.load("embedding.npy")
print(f"Loaded embedding: {embedding.shape}")
| Problem | Cause | Solution |
|---|---|---|
CUDA out of memory | Model too large for GPU | Use smaller model (esmc_300m), reduce batch size, or use Forge cloud API |
RuntimeError: no CUDA device | No GPU available | Models work on CPU (slower). Set device="cpu" or use Forge API |
| Slow generation | Too many num_steps or CPU inference | Reduce num_steps (8 for drafts), use GPU, or use Forge API for large models |
ImportError: esm | Package not installed | pip install esm (note: this is EvolutionaryScale's esm, not the older Facebook Research esm) |
| Low-quality generated sequences | Temperature too high or too few steps | Lower temperature to 0.5, increase num_steps to 32+ |
| Forge API authentication error | Invalid or missing API token | Set FORGE_API_TOKEN env var or pass token= explicitly; get token from forge.evolutionaryscale.ai |
KeyError loading model weights | Wrong model name | Use exact names: "esm3_sm_open_v1", "esmc_300m", "esmc_600m" |