From sciagent-skills
Featurizes SMILES into fingerprints (ECFP, MACCS, MAP4), descriptors (RDKit, Mordred), pretrained embeddings (ChemBERTa, GIN, Graphormer) using Molfeat. Scikit-learn compatible for QSAR, virtual screening, similarity, molecular ML.
npx claudepluginhub jaechang-hits/sciagent-skills --plugin sciagent-skillsThis skill uses the workspace's default tool permissions.
Molfeat is a comprehensive Python library for molecular featurization that unifies 100+ pre-trained embeddings and hand-crafted featurizers under a scikit-learn compatible API. Convert SMILES strings into numerical representations (fingerprints, descriptors, deep learning embeddings) for QSAR modeling, virtual screening, similarity searching, and chemical space analysis.
Featurizes SMILES or RDKit molecules into ML vectors with 100+ methods (ECFP, MACCS, descriptors, ChemBERTa) for QSAR, virtual screening, and molecular ML tasks.
Featurizes SMILES strings and RDKit molecules into ML-ready numerical vectors using 100+ methods including ECFP, MACCS, descriptors, and ChemBERTa pretrained models. For QSAR, virtual screening, similarity search, and molecular ML.
Provides Pythonic datamol wrapper over RDKit for drug discovery: SMILES parsing/standardization, descriptors/fingerprints, similarity/clustering/diversity, scaffolds, BRICS/RECAP fragmentation, 3D conformers, visualization.
Share bugs, ideas, or general feedback.
Molfeat is a comprehensive Python library for molecular featurization that unifies 100+ pre-trained embeddings and hand-crafted featurizers under a scikit-learn compatible API. Convert SMILES strings into numerical representations (fingerprints, descriptors, deep learning embeddings) for QSAR modeling, virtual screening, similarity searching, and chemical space analysis.
uv pip install molfeat
# Optional extras for specific featurizer types
uv pip install "molfeat[transformer]" # ChemBERTa, ChemGPT, MolT5
uv pip install "molfeat[dgl]" # GIN graph neural networks
uv pip install "molfeat[graphormer]" # Graphormer models
uv pip install "molfeat[fcd]" # FCD descriptors
uv pip install "molfeat[map4]" # MAP4 fingerprints
uv pip install "molfeat[all]" # All dependencies
from molfeat.calc import FPCalculator
from molfeat.trans import MoleculeTransformer
smiles = ["CCO", "CC(=O)O", "c1ccccc1", "CC(C)O"]
# Create fingerprint calculator + transformer
calc = FPCalculator("ecfp", radius=3, fpSize=2048)
transformer = MoleculeTransformer(calc, n_jobs=-1)
# Featurize batch in parallel
features = transformer(smiles)
print(f"Shape: {features.shape}") # (4, 2048)
# Save configuration for reproducibility
transformer.to_state_yaml_file("featurizer_config.yml")
Molfeat organizes featurization into three layers:
| Layer | Class | Purpose | Use When |
|---|---|---|---|
| Calculator | molfeat.calc.* | Single molecule → feature vector | Custom loops, single molecules |
| Transformer | molfeat.trans.MoleculeTransformer | Batch processing with parallelization | Datasets, scikit-learn pipelines |
| Store | molfeat.store.ModelStore | Discovery and loading of pretrained models | Finding available featurizers |
Calculators are callable: calc("CCO") returns a numpy array. Transformers wrap calculators for batch processing: transformer(smiles_list) returns a 2D array. Pretrained transformers (PretrainedMolTransformer) add batched GPU inference and caching.
| Task | Recommended | Dimensions | Speed |
|---|---|---|---|
| General QSAR | ecfp (radius=3) | 2048 | Fast |
| Scaffold similarity | maccs | 167 | Very fast |
| Large-scale screening | map4 | 1024 | Fast |
| Interpretable models | desc2D (RDKitDescriptors2D) | 200+ | Fast |
| Comprehensive descriptors | mordred | 1800+ | Medium |
| Transfer learning | ChemBERTa-77M-MLM | 768 | Slow* |
| Graph-based DL | gin-supervised-masking | Variable | Slow* |
| Pharmacophore | fcfp or cats2D | 2048 / 21 | Fast |
| 3D shape | usr / usrcat | 12 / 60 | Fast |
*First run slow; subsequent runs cached.
Save and reload exact featurizer configuration for reproducibility:
# Save
transformer.to_state_yaml_file("config.yml")
transformer.to_state_json_file("config.json")
# Reload
loaded = MoleculeTransformer.from_state_yaml_file("config.yml")
from molfeat.calc import FPCalculator
# ECFP — most popular, general-purpose
ecfp = FPCalculator("ecfp", radius=3, fpSize=2048)
fp = ecfp("CCO")
print(f"ECFP shape: {fp.shape}") # (2048,)
# MACCS keys — 167-bit structural keys, fast scaffold similarity
maccs = FPCalculator("maccs")
fp = maccs("c1ccccc1")
print(f"MACCS shape: {fp.shape}") # (167,)
# Count-based fingerprints (non-binary)
ecfp_count = FPCalculator("ecfp-count", radius=3, fpSize=2048)
# MAP4 — MinHashed atom-pair, efficient for large databases
map4 = FPCalculator("map4")
print(f"MAP4 shape: {map4('CCO').shape}") # (1024,)
Available fingerprint types: ecfp, fcfp, maccs, rdkit, avalon, pattern, layered, atompair, topological, map4, secfp, erg, estate (and count variants with -count suffix).
from molfeat.calc import RDKitDescriptors2D, MordredDescriptors
# RDKit 2D — 200+ named properties (MW, logP, TPSA, etc.)
desc2d = RDKitDescriptors2D()
descriptors = desc2d("CCO")
print(f"2D descriptors: {len(descriptors)}") # 200+
print(f"Feature names: {desc2d.columns[:5]}")
# Mordred — 1800+ comprehensive descriptors
mordred = MordredDescriptors()
descriptors = mordred("c1ccccc1O")
print(f"Mordred descriptors: {len(descriptors)}") # 1800+
from molfeat.calc import CATSCalculator, USRDescriptors
# CATS — pharmacophore point pair distributions
cats = CATSCalculator(mode="2D", scale="raw")
descriptors = cats("CC(C)Cc1ccc(C)cc1C")
print(f"CATS shape: {descriptors.shape}") # (21,)
# USR — ultrafast shape recognition
usr = USRDescriptors()
shape = usr("CC(=O)Oc1ccccc1C(=O)O")
print(f"USR shape: {shape.shape}") # (12,)
from molfeat.trans import MoleculeTransformer, FeatConcat
from molfeat.calc import FPCalculator
smiles = ["CCO", "CC(=O)O", "c1ccccc1", "CC(C)O", "CCCC"]
# Parallel batch processing
transformer = MoleculeTransformer(FPCalculator("ecfp"), n_jobs=-1)
features = transformer(smiles)
print(f"Batch shape: {features.shape}") # (5, 2048)
# Concatenate multiple featurizers
concat = FeatConcat([
FPCalculator("maccs"), # 167 dims
FPCalculator("ecfp") # 2048 dims
])
combo_transformer = MoleculeTransformer(concat, n_jobs=-1)
combo_features = combo_transformer(smiles)
print(f"Combined shape: {combo_features.shape}") # (5, 2215)
# Error-tolerant processing
safe_transformer = MoleculeTransformer(
FPCalculator("ecfp"), n_jobs=-1,
ignore_errors=True, verbose=True
)
features = safe_transformer(["CCO", "invalid", "c1ccccc1"])
# Returns None for failed molecules
from molfeat.trans.pretrained import PretrainedMolTransformer
# ChemBERTa — RoBERTa trained on 77M PubChem compounds
chemberta = PretrainedMolTransformer("ChemBERTa-77M-MLM", n_jobs=-1)
embeddings = chemberta(["CCO", "CC(=O)O", "c1ccccc1"])
print(f"ChemBERTa shape: {embeddings.shape}") # (3, 768)
# GIN — graph neural network pretrained on ChEMBL
gin = PretrainedMolTransformer("gin-supervised-masking", n_jobs=-1)
graph_emb = gin(["CCO", "CC(=O)O"])
print(f"GIN shape: {graph_emb.shape}")
from molfeat.store.modelstore import ModelStore
store = ModelStore()
print(f"Total available: {len(store.available_models)}")
# Search for specific model
results = store.search(name="ChemBERTa")
for model in results:
print(f" {model.name}: {model.description}")
# View usage and load
card = store.search(name="ChemBERTa-77M-MLM")[0]
card.usage()
transformer = store.load("ChemBERTa-77M-MLM")
from molfeat.calc import FPCalculator
from molfeat.trans import MoleculeTransformer
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score
# Featurize molecules
transformer = MoleculeTransformer(FPCalculator("ecfp", radius=3), n_jobs=-1)
X = transformer(smiles_train)
print(f"Features shape: {X.shape}")
# Train and evaluate
model = RandomForestRegressor(n_estimators=100)
scores = cross_val_score(model, X, y_train, cv=5, scoring='r2')
print(f"R² = {scores.mean():.3f} ± {scores.std():.3f}")
# Save for deployment
transformer.to_state_yaml_file("production_featurizer.yml")
from sklearn.ensemble import RandomForestClassifier
# Step 1: Featurize known actives/inactives
transformer = MoleculeTransformer(FPCalculator("ecfp"), n_jobs=-1)
X_train = transformer(train_smiles)
# Step 2: Train classifier
clf = RandomForestClassifier(n_estimators=500, n_jobs=-1)
clf.fit(X_train, train_labels)
# Step 3: Screen library (e.g., 1M compounds)
X_screen = transformer(screening_smiles)
predictions = clf.predict_proba(X_screen)[:, 1]
# Step 4: Rank and select top hits
top_indices = predictions.argsort()[::-1][:1000]
top_hits = [screening_smiles[i] for i in top_indices]
print(f"Top 1000 hits selected from {len(screening_smiles)} compounds")
from molfeat.calc import FPCalculator, RDKitDescriptors2D
from sklearn.metrics import roc_auc_score
featurizers = {
'ECFP': FPCalculator("ecfp"),
'MACCS': FPCalculator("maccs"),
'Descriptors': RDKitDescriptors2D(),
}
for name, calc in featurizers.items():
transformer = MoleculeTransformer(calc, n_jobs=-1)
X_train = transformer(smiles_train)
X_test = transformer(smiles_test)
clf = RandomForestClassifier(n_estimators=100)
clf.fit(X_train, y_train)
auc = roc_auc_score(y_test, clf.predict_proba(X_test)[:, 1])
print(f"{name}: AUC = {auc:.3f}")
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
pipeline = Pipeline([
('featurizer', MoleculeTransformer(FPCalculator("ecfp"), n_jobs=-1)),
('classifier', RandomForestClassifier(n_estimators=100))
])
pipeline.fit(smiles_train, y_train)
predictions = pipeline.predict(smiles_test)
from sklearn.metrics.pairwise import cosine_similarity
calc = FPCalculator("ecfp")
query_fp = calc("CC(=O)Oc1ccccc1C(=O)O").reshape(1, -1) # Aspirin
transformer = MoleculeTransformer(calc, n_jobs=-1)
db_fps = transformer(database_smiles)
similarities = cosine_similarity(query_fp, db_fps)[0]
top_k = similarities.argsort()[-10:][::-1]
for i in top_k:
print(f" {database_smiles[i]}: {similarities[i]:.3f}")
import numpy as np
def featurize_chunks(smiles_list, transformer, chunk_size=10000):
all_features = []
for i in range(0, len(smiles_list), chunk_size):
chunk = smiles_list[i:i+chunk_size]
features = transformer(chunk)
all_features.append(features)
print(f"Processed {min(i+chunk_size, len(smiles_list))}/{len(smiles_list)}")
return np.vstack(all_features)
| Parameter | Module | Default | Description |
|---|---|---|---|
method | FPCalculator | — | Fingerprint type: ecfp, maccs, map4, etc. |
radius | FPCalculator | 3 | Circular fingerprint radius |
fpSize | FPCalculator | 2048 | Fingerprint bit length |
counting | FPCalculator | False | Count vector instead of binary |
n_jobs | MoleculeTransformer | 1 | Parallel workers (-1 = all cores) |
ignore_errors | MoleculeTransformer | False | Skip invalid molecules (returns None) |
verbose | MoleculeTransformer | False | Log processing details |
dtype | MoleculeTransformer | float64 | Output type (float32 for memory) |
mode | CATSCalculator | "2D" | Distance calculation mode |
scale | CATSCalculator | "raw" | Scaling: raw, num, count |
n_jobs=-1 for parallel processing on all CPU cores — significant speedup for batch featurizationignore_errors=True for large datasets — invalid SMILES won't crash the pipelineto_state_yaml_file() for reproducibility — recreate exact featurizer laterMoleculeTransformer(calc, dtype=np.float32)FeatConcat to capture complementary molecular information| Problem | Cause | Solution |
|---|---|---|
ValueError: unsupported featurizer | Unknown method name | Check FPCalculator supported types or use ModelStore.search() |
ImportError for pretrained model | Missing optional dependency | Install extras: pip install "molfeat[transformer]" or "molfeat[dgl]" |
None in output array | Invalid SMILES with ignore_errors=True | Filter results: [f for f in features if f is not None] |
| Memory error on large dataset | Too many molecules at once | Process in chunks of 10K-50K (see Recipes) |
| Slow pretrained model inference | First run downloads model weights | Normal — subsequent runs use cache |
| Shape mismatch in pipeline | Mixed valid/invalid molecules | Ensure ignore_errors=True and filter None before ML model |
| Reproducibility issues | Different molfeat versions | Pin version and save config: transformer.to_state_yaml_file() |
datamol-molecular-toolkit — High-level molecular manipulation (standardization, I/O, conformers)rdkit-molecular-toolkit — Low-level cheminformatics (substructure, reactions, 3D)scikit-learn — ML models consuming molfeat featuresMain SKILL.md + 2 reference files. Original total: 1,273 lines (SKILL.md 510 + api_reference.md 429 + available_featurizers.md 334). Scripts: none. Examples: 724 lines (examples.md).
references/available_featurizers.md: Complete catalog of all 100+ featurizers organized by category — transformer models, GNNs, descriptors, fingerprints, pharmacophore, shape, scaffold, graph featurizers. Includes dimensions, dependencies, and selection guidance per category. Purely lookup-oriented content preserved as reference.
references/api_reference.md: Detailed API reference for molfeat.calc, molfeat.trans, and molfeat.store modules. Covers SerializableCalculator base class, all calculator subclasses with parameters, MoleculeTransformer methods, PretrainedMolTransformer, FeatConcat, ModelStore/ModelCard API, data type control, and PyTorch integration patterns.
Original file disposition:
SKILL.md (510 lines) → Core API modules 1-6, Key Concepts (architecture, selection guide), Quick Start, Workflows 1-3. "Choosing the Right Featurizer" → Key Concepts selection guide table. "Advanced Features" (custom preprocessing, batch processing, caching) → Recipes + Best Practices. "Common Featurizers Reference" table → Key Concepts selection guide. "Performance Tips" → Best Practices. Per-use-case disposition: QSAR Modeling → Workflow 1, Virtual Screening → Workflow 2, Similarity Search → Recipe, Chemical Space → When to Use bullet, scikit-learn Pipeline → Recipe, Featurizer Comparison → Workflow 3references/api_reference.md (429 lines) → Migrated to new references/api_reference.md. Core patterns (FPCalculator, MoleculeTransformer, basic ModelStore) relocated to SKILL.md Core API modules 1-6. Detailed class methods, SerializableCalculator base class, PrecomputedMolTransformer, and PyTorch integration retained in referencereferences/available_featurizers.md (334 lines) → Migrated to new references/available_featurizers.md. Top-level summary → Key Concepts selection guide table. Full categorized catalog retained in referencereferences/examples.md (724 lines) → Fully consolidated inline: installation → Prerequisites; calculator examples → Core API 1-3; transformer examples → Core API 4; pretrained examples → Core API 5; ML integration → Workflows 1-3 + Recipes; advanced patterns (custom preprocessing, caching, chunk processing) → Recipes + Best Practices; troubleshooting → Troubleshooting table. No separate reference file needed — all content absorbed into SKILL.md sectionsRetention: ~490 lines (SKILL.md) + ~170 lines (available_featurizers) + ~190 lines (api_reference) = ~850 / 1,273 original (excl. examples.md treated as consolidated) = ~67%. Including examples.md in denominator: ~850 / 1,997 = ~43%.