From sciagent-skills
Loads PyTDC Therapeutics Data Commons datasets for drug discovery ML including ADME, toxicity, DTI, DDI with scaffold splits, benchmarks, metrics, and molecular oracles.
npx claudepluginhub jaechang-hits/sciagent-skills --plugin sciagent-skillsThis skill uses the workspace's default tool permissions.
PyTDC is an open-science platform providing AI-ready datasets and benchmarks for drug discovery. It organizes therapeutics data into three categories: single-instance prediction (molecular/protein properties), multi-instance prediction (drug-target interactions), and generation (molecule design, retrosynthesis). All datasets come with standardized splits, evaluation metrics, and molecular oracles.
Loads PyTDC datasets and benchmarks for drug discovery ML: ADME, toxicity, DTI tasks with scaffold splits and molecular oracles.
Provides AI-ready drug discovery datasets (ADME, toxicity, DTI), benchmarks, scaffold splits, and molecular oracles for therapeutic ML and pharmacological predictions.
Provides PyTorch GNN tools for drug discovery ML: molecular property prediction (ADMET, activity), retrosynthesis, DTI, pretraining, with layers (GraphConv, GAT), models, and benchmarks.
Share bugs, ideas, or general feedback.
PyTDC is an open-science platform providing AI-ready datasets and benchmarks for drug discovery. It organizes therapeutics data into three categories: single-instance prediction (molecular/protein properties), multi-instance prediction (drug-target interactions), and generation (molecule design, retrosynthesis). All datasets come with standardized splits, evaluation metrics, and molecular oracles.
chembl-database-bioactivity insteadmolfeat insteaduv pip install PyTDC
# Core deps: numpy, pandas, scikit-learn, tqdm, fuzzywuzzy
# Optional: rdkit (scaffold splits), torch-geometric (PyG conversion)
API Note: TDC downloads datasets on first access (~10-500 MB per dataset). Specify path='data/' to control download location. No API key required.
from tdc.single_pred import ADME
from tdc import Evaluator
# Load dataset with scaffold split
data = ADME(name='Caco2_Wang')
split = data.get_split(method='scaffold', seed=42, frac=[0.7, 0.1, 0.2])
train, valid, test = split['train'], split['valid'], split['test']
print(f"Train: {len(train)}, Valid: {len(valid)}, Test: {len(test)}")
# Train: ~640, Valid: ~91, Test: ~182
# Evaluate predictions
evaluator = Evaluator(name='MAE')
# score = evaluator(test['Y'].values, predictions)
Load datasets for predicting properties of individual molecules or proteins.
from tdc.single_pred import ADME, Tox, HTS, QM
# ADME — pharmacokinetic properties
data = ADME(name='Caco2_Wang') # Intestinal permeability (regression)
data = ADME(name='BBB_Martins') # Blood-brain barrier (binary)
data = ADME(name='Lipophilicity_AstraZeneca') # LogD (regression)
data = ADME(name='Solubility_AqSolDB') # Aqueous solubility
# Toxicity — adverse effects
data = Tox(name='hERG') # Cardiotoxicity (binary)
data = Tox(name='AMES') # Mutagenicity (binary)
data = Tox(name='DILI') # Drug-induced liver injury
data = Tox(name='ClinTox') # Clinical trial toxicity
# Access data as DataFrame
df = data.get_data(format='df')
print(df.columns.tolist())
# ['Drug_ID', 'Drug', 'Y'] — Drug is SMILES, Y is target label
print(f"Dataset size: {len(df)}, Label range: [{df['Y'].min():.2f}, {df['Y'].max():.2f}]")
Other single-prediction tasks: HTS (screening), QM (quantum mechanics), Yields, Epitope, Develop, CRISPROutcome.
Load datasets for predicting interactions between pairs of biomedical entities.
from tdc.multi_pred import DTI, DDI, PPI
# Drug-Target Interaction — binding affinity
data = DTI(name='BindingDB_Kd') # 52,284 pairs, Kd values
data = DTI(name='DAVIS') # 30,056 pairs, kinase binding
data = DTI(name='KIBA') # 118,254 pairs, kinase bioactivity
# Drug-Drug Interaction — interaction type prediction
data = DDI(name='DrugBank') # 191,808 pairs, 86 interaction types
# Protein-Protein Interaction
data = PPI(name='HuRI')
# Multi-instance data format
df = data.get_data(format='df')
print(df.columns.tolist())
# ['Drug_ID', 'Drug', 'Target_ID', 'Target', 'Y']
# Drug=SMILES, Target=protein sequence, Y=binding affinity or class
Other multi-instance tasks: GDA, DrugRes, DrugSyn, PeptideMHC, AntibodyAff, MTI, Catalyst, TrialOutcome.
Load training sets and oracles for molecule generation and retrosynthesis.
from tdc.generation import MolGen, RetroSyn, PairMolGen
from tdc import Oracle
# Molecule generation — training data
data = MolGen(name='ChEMBL_V29') # 1.6M drug-like SMILES
split = data.get_split()
train_smiles = split['train']['Drug'].tolist()
# Oracle scoring — evaluate generated molecules
oracle = Oracle(name='GSK3B') # GSK3B inhibition predictor (0-1)
score = oracle('CC(C)Cc1ccc(cc1)C(C)C(O)=O')
print(f"GSK3B score: {score:.4f}")
# Batch evaluation
scores = oracle(['CCO', 'c1ccccc1', 'CC(=O)O'])
print(f"Batch scores: {scores}")
# Retrosynthesis — reaction prediction
data = RetroSyn(name='USPTO') # 1.9M reactions
split = data.get_split()
# Paired generation — prodrug design
data = PairMolGen(name='Prodrug')
Apply meaningful data splits and standardized evaluation metrics.
from tdc.single_pred import ADME
from tdc.multi_pred import DTI
from tdc import Evaluator
# Scaffold split — ensures chemical diversity between sets
data = ADME(name='Caco2_Wang')
split = data.get_split(method='scaffold', seed=42, frac=[0.7, 0.1, 0.2])
# Cold splits — for DTI (unseen drugs/targets in test set)
data = DTI(name='BindingDB_Kd')
cold_drug = data.get_split(method='cold_drug', seed=1)
cold_target = data.get_split(method='cold_target', seed=1)
# Verify no overlap in cold split
train_drugs = set(cold_drug['train']['Drug_ID'])
test_drugs = set(cold_drug['test']['Drug_ID'])
print(f"Drug overlap: {len(train_drugs & test_drugs)}") # 0
# Evaluation metrics
eval_mae = Evaluator(name='MAE')
eval_auc = Evaluator(name='ROC-AUC')
eval_spearman = Evaluator(name='Spearman')
# score = eval_mae(y_true, y_pred)
Available split methods: random, scaffold (Bemis-Murcko), cold_drug, cold_target, cold_drug_target, temporal.
Available metrics: Classification — ROC-AUC, PR-AUC, F1, Accuracy, Kappa. Regression — RMSE, MAE, R2, MSE. Ranking — Spearman, Pearson. Multi-label — Micro-F1, Macro-F1.
Run standardized multi-seed evaluation protocols for model comparison.
from tdc.benchmark_group import admet_group
# Load ADMET benchmark (22 datasets)
group = admet_group(path='data/')
# Standard 5-seed evaluation protocol
benchmark = group.get('Caco2_Wang')
predictions = {}
for seed in [1, 2, 3, 4, 5]:
train_df = benchmark['train']
valid_df = benchmark['valid']
test_df = benchmark['test']
# Train your model on train_df, tune on valid_df
# predictions[seed] = model.predict(test_df['Drug'])
predictions[seed] = test_df['Y'].values # placeholder
# Get benchmark results
results = group.evaluate(predictions)
print(f"Mean MAE: {results['Caco2_Wang'][0]:.4f} ± {results['Caco2_Wang'][1]:.4f}")
| Category | Import Path | Task Examples | Data Format |
|---|---|---|---|
| Single-Instance | tdc.single_pred | ADME, Tox, HTS, QM | Drug (SMILES) + Y (label) |
| Multi-Instance | tdc.multi_pred | DTI, DDI, PPI, DrugSyn | Drug + Target + Y |
| Generation | tdc.generation | MolGen, RetroSyn | SMILES collections |
| Benchmark | tdc.benchmark_group | admet_group | Curated splits |
| Category | Examples | Speed | Output Range |
|---|---|---|---|
| Biochemical | DRD2, GSK3B, JNK3, 5HT2A | Medium (ML) | 0-1 probability |
| Physicochemical | QED, SA, LogP, MW | Fast (rule-based) | Varies by metric |
| Composite | Isomer_Meta, Median1/2, Rediscovery | Medium | 0-1 combined |
| Specialized | ASKCOS, Docking, Vina | Slow (external) | Varies |
| Utility | Function | Example |
|---|---|---|
| Format conversion | MolConvert(src, dst) | SMILES → PyG, ECFP, SELFIES, DGL |
| Molecule filters | MolFilter(filters) | PAINS, BMS, Glaxo, drug-likeness |
| Label binarization | label_transform() | Continuous → binary at threshold |
| Unit conversion | label_transform(from_unit, to_unit) | nM → pIC50 |
| ID resolution | cid2smiles(), uniprot2seq() | PubChem CID → SMILES |
| Dataset listing | retrieve_dataset_names(task) | List all ADME datasets |
from tdc.single_pred import ADME
from tdc import Evaluator
import numpy as np
data = ADME(name='Caco2_Wang')
evaluator = Evaluator(name='MAE')
results = []
for seed in [1, 2, 3, 4, 5]:
split = data.get_split(method='scaffold', seed=seed)
train, valid, test = split['train'], split['valid'], split['test']
# model.fit(train['Drug'], train['Y'])
# preds = model.predict(test['Drug'])
preds = test['Y'].values + np.random.normal(0, 0.1, len(test)) # placeholder
score = evaluator(test['Y'].values, preds)
results.append(score)
print(f"Seed {seed}: MAE = {score:.4f}")
print(f"Mean MAE: {np.mean(results):.4f} ± {np.std(results):.4f}")
from tdc import Oracle
import numpy as np
# Define multi-objective scoring
oracles = {
'QED': (Oracle(name='QED'), 0.3), # drug-likeness
'SA': (Oracle(name='SA'), 0.3), # synthetic accessibility
'GSK3B': (Oracle(name='GSK3B'), 0.4), # target activity
}
test_smiles = ['CC(C)Cc1ccc(cc1)C(C)C(O)=O', 'c1ccc2c(c1)cc1ccc3cccc4ccc2c1c34']
for smi in test_smiles:
scores = {}
weighted_sum = 0
for name, (oracle, weight) in oracles.items():
score = oracle(smi)
scores[name] = score
weighted_sum += score * weight
print(f"SMILES: {smi[:30]}...")
print(f" Scores: {scores}")
print(f" Weighted: {weighted_sum:.4f}")
DTI(name='BindingDB_Kd') (Core API Module 2)data.get_split(method='cold_drug', seed=seed) (Core API Module 4)Evaluator(name='Spearman') (Module 4)| Parameter | Function/Module | Default | Range/Options | Effect |
|---|---|---|---|---|
method | get_split() | 'scaffold' | random, scaffold, cold_drug, cold_target, temporal | Split strategy for train/test |
seed | get_split() | 42 | 1-5 for benchmarks | Reproducibility; use 5 seeds for benchmarks |
frac | get_split() | [0.7, 0.1, 0.2] | Sum must equal 1.0 | Train/valid/test proportions |
name | Evaluator() | — | MAE, RMSE, ROC-AUC, Spearman, etc. | Evaluation metric |
name | Oracle() | — | QED, SA, GSK3B, DRD2, etc. | Scoring function for molecules |
src/dst | MolConvert() | — | SMILES, SELFIES, PyG, DGL, ECFP4 | Molecular representation formats |
path | admet_group() | 'data/' | Any directory | Dataset download/cache location |
format | get_data() | 'df' | 'df', 'dict' | Output data format |
cold_drug tests generalization to unseen drugs, cold_target to unseen targetsMolFilter (PAINS, drug-likeness) before training to remove problematic compoundspath — avoids re-downloading large datasets across sessionsfrom tdc.single_pred import ADME
from tdc.utils import retrieve_dataset_names
# List all ADME datasets
datasets = retrieve_dataset_names('ADME')
print(f"Available ADME datasets: {datasets}")
# Load and inspect
data = ADME(name='Caco2_Wang')
df = data.get_data(format='df')
print(f"Size: {len(df)}")
print(f"Label stats: mean={df['Y'].mean():.2f}, std={df['Y'].std():.2f}")
print(f"SMILES example: {df['Drug'].iloc[0]}")
from tdc.chem_utils import MolConvert
# SMILES to multiple representations
smiles = 'CC(C)Cc1ccc(cc1)C(C)C(O)=O'
converter_ecfp = MolConvert(src='SMILES', dst='ECFP4')
converter_selfies = MolConvert(src='SMILES', dst='SELFIES')
ecfp = converter_ecfp(smiles)
selfies = converter_selfies(smiles)
print(f"ECFP4 shape: {ecfp.shape}") # (1024,) binary fingerprint
print(f"SELFIES: {selfies}")
from tdc import Oracle
# Define property constraints
constraints = {
'QED': (Oracle(name='QED'), 0.5, 1.0), # min, max
'SA': (Oracle(name='SA'), 1.0, 4.0),
'LogP': (Oracle(name='LogP'), -0.5, 5.0),
}
def check_constraints(smiles):
"""Check if molecule satisfies all property constraints."""
results = {}
all_pass = True
for name, (oracle, lo, hi) in constraints.items():
score = oracle(smiles)
passed = lo <= score <= hi
results[name] = {'score': score, 'passed': passed}
all_pass = all_pass and passed
return all_pass, results
passed, details = check_constraints('CC(C)Cc1ccc(cc1)C(C)C(O)=O')
for name, info in details.items():
status = "PASS" if info['passed'] else "FAIL"
print(f"{name}: {info['score']:.3f} [{status}]")
| Problem | Cause | Solution |
|---|---|---|
ModuleNotFoundError: tdc | Package not installed | uv pip install PyTDC |
| Scaffold split fails | Missing RDKit dependency | uv pip install rdkit for scaffold decomposition |
| Dataset download timeout | Large dataset or slow connection | Set path='data/' for persistent cache; retry |
KeyError on dataset name | Wrong name or task category | Use retrieve_dataset_names('ADME') to list valid names |
| Oracle returns NaN | Invalid SMILES or RDKit parse failure | Validate SMILES with RDKit MolFromSmiles() first |
| Cold split empty test set | Too few unique entities | Use frac=[0.7, 0.1, 0.2] with larger datasets |
| Benchmark evaluation error | Wrong prediction format | Pass dict with seeds as keys: {1: preds1, 2: preds2, ...} |
| Memory error on large dataset | Full dataset loaded to memory | Process in chunks or use smaller split fractions |
| PyG conversion fails | torch-geometric not installed | uv pip install torch-geometric for graph conversion |
Covers: complete catalog of all TDC datasets organized by task category (single-instance, multi-instance, generation) with dataset names, sizes, label types, and data sources. Relocated inline: top ADME/Tox/DTI datasets with code examples consolidated into Core API Modules 1-2. Omitted: None — all dataset entries preserved in catalog format.
Covers: detailed oracle documentation (all 17+ oracles with parameters, speed tiers, output ranges, custom oracle template) and data processing utilities (format conversion targets, molecule filter types, label transformation, entity resolution). Consolidated from original oracles.md + utilities.md. Relocated inline: Quick Start oracle usage pattern, top evaluation metrics, split methods, MolConvert pattern → Core API Modules 3-4. Omitted: distribution learning KS-test example — niche statistical comparison; leaderboard submission guide — platform-specific.
load_and_split_data.py (215 lines): scaffold/cold split patterns → Core API Module 4; custom split fractions → Key Parameters; evaluation examples → Workflow 1. Thin wrappers around get_split() and Evaluator().benchmark_evaluation.py (328 lines): 5-seed protocol → Core API Module 5 + Workflow 1; multi-dataset evaluation → Workflow 1 pattern; leaderboard guide → omitted (platform-specific).molecular_generation.py (405 lines): single/batch oracle usage → Core API Module 3; multi-objective scoring → Workflow 2; constraint satisfaction → Recipe 3; distribution learning → omitted (niche).