From chemaudit
Processes CSV, TSV, TXT, SDF molecule files via ChemAudit batch pipeline with Redis progress tracking, on-demand analytics (scaffold, clustering, MMP, R-group), and nine export formats.
npx claudepluginhub kohulan/chemauditThis skill is limited to using the following tools:
File-in, file-out bulk validation with three phases:
Curates single or batch molecules (SMILES/CSV/SDF) via ChemAudit's 10-step QSAR-ready pipeline for ML canonical SMILES, provenance, and InChIKey changes.
Simplifies RDKit for drug discovery with Pythonic API: SMILES parsing, standardization, descriptors, fingerprints, clustering, 3D conformers, parallel processing.
Simplifies RDKit for cheminformatics and drug discovery: SMILES parsing, standardization, descriptors, fingerprints, clustering, 3D conformers, parallel processing.
Share bugs, ideas, or general feedback.
File-in, file-out bulk validation with three phases:
job_id response./status polling.All state lives in Redis keyed by job_id. Session cookies scope access (one owner per job).
curl -sS http://localhost:8000/api/v1/config
Returns {"limits": {"max_batch_size": N, "max_file_size_mb": M}, "deployment_profile": "medium"}. Profiles (small/medium/large/xl/coconut) set these via config/*.yml.
Before upload, probe a CSV for SMILES and Name columns:
curl -sS -X POST http://localhost:8000/api/v1/batch/detect-columns \
-F "file=@compounds.csv"
Returns {columns, suggested_smiles, suggested_name, column_samples, row_count_estimate, file_size_mb}.
curl -sS -X POST http://localhost:8000/api/v1/batch/upload \
-F "file=@compounds.csv" \
-F "smiles_column=SMILES" \
-F "name_column=Name" \
-F "include_extended_safety=false" \
-F "include_chembl_alerts=false" \
-F "include_standardization=true" \
-F "include_profiling=false" \
-F "include_safety_assessment=false" \
-F "profile_id=" \
-F "notification_email="
Accepts .sdf, .csv, .tsv, .txt. Response:
{
"job_id": "9e3b2c1e-....",
"status": "pending",
"total_molecules": 450,
"message": "Job submitted. Processing 450 molecules."
}
Form fields:
smiles_column — CSV only; defaults to SMILES.name_column — CSV only; optional.include_extended_safety — add NIH and ZINC catalogs.include_chembl_alerts — add 7 ChEMBL pharma filter sets (BMS, Dundee, Glaxo, Inpharmatica, LINT, MLSMR, SureChEMBL).include_standardization — run ChEMBL standardization pipeline per molecule.include_profiling — PFI, #stars, Abbott bioavailability, CNS MPO.include_safety_assessment — CYP, hERG, bRo5, REOS, complexity panel.profile_id — scoring profile ID; adds profile desirability score per molecule.notification_email — sends a completion email (validated server-side).WebSocket (preferred):
const ws = new WebSocket(`ws://localhost:8000/ws/batch/${job_id}`);
ws.onmessage = (e) => console.log(JSON.parse(e.data));
// Send "ping" to keep alive; server replies "pong".
Messages: {job_id, status, progress, processed, total, eta_seconds}.
Polling:
curl -sS http://localhost:8000/api/v1/batch/<job_id>/status
curl -sS "http://localhost:8000/api/v1/batch/<job_id>?page=1&page_size=50&status_filter=error&min_score=0&max_score=40"
Filters: status_filter (success/error), min_score/max_score (0–100), sort_by (index/name/smiles/score/qed/safety/status/issues), sort_dir (asc/desc), issue_filter (failed check name), alert_filter (catalog name).
Stats-only endpoint (cheap, no results array):
curl -sS http://localhost:8000/api/v1/batch/<job_id>/stats
Get current analytics status and any cached results:
curl -sS http://localhost:8000/api/v1/batch/<job_id>/analytics
Trigger a specific computation:
curl -sS -X POST http://localhost:8000/api/v1/batch/<job_id>/analytics/scaffold
curl -sS -X POST http://localhost:8000/api/v1/batch/<job_id>/analytics/chemical_space \
-H 'Content-Type: application/json' -d '{"method": "tsne"}'
curl -sS -X POST http://localhost:8000/api/v1/batch/<job_id>/analytics/clustering \
-H 'Content-Type: application/json' -d '{"cutoff": 0.35}'
curl -sS -X POST http://localhost:8000/api/v1/batch/<job_id>/analytics/mmp \
-H 'Content-Type: application/json' -d '{"activity_column": "pIC50"}'
curl -sS -X POST http://localhost:8000/api/v1/batch/<job_id>/analytics/similarity_search \
-H 'Content-Type: application/json' -d '{"query_smiles": "CCO", "top_k": 20}'
curl -sS -X POST http://localhost:8000/api/v1/batch/<job_id>/analytics/rgroup \
-H 'Content-Type: application/json' -d '{"core_smarts": "c1ccccc1"}'
curl -sS -X POST http://localhost:8000/api/v1/batch/<job_id>/analytics/taxonomy
Supported analysis types: scaffold, chemical_space, mmp, similarity_search, rgroup, clustering, taxonomy.
Clustering cap: Butina clustering is hard-capped at 1,000 molecules (D-06). Subsample or filter first on larger jobs.
Triggering an analysis that's already cached returns already_complete unless params differ (e.g. different method for chemical_space or different cutoff for clustering — those trigger recompute).
Between any two molecules in the batch by index:
curl -sS -X POST http://localhost:8000/api/v1/batch/<job_id>/mcs \
-H 'Content-Type: application/json' \
-d '{"index_a": 0, "index_b": 17}'
Returns MCS SMARTS, Tanimoto similarity, matched atom/bond counts, property deltas.
Apply an operation to a hand-picked subset from the UI or another filter pass:
# Revalidate — creates new batch with the chosen indices
curl -sS -X POST http://localhost:8000/api/v1/batch/<job_id>/subset/revalidate \
-H 'Content-Type: application/json' \
-d '{"indices": [0, 3, 17, 42]}'
# Rescore with a profile — creates new batch or scores inline
curl -sS -X POST http://localhost:8000/api/v1/batch/<job_id>/subset/rescore \
-H 'Content-Type: application/json' \
-d '{"indices": [0, 3, 17], "profile_id": 4}'
# Inline rescore (no Celery, sync response)
curl -sS -X POST http://localhost:8000/api/v1/batch/<job_id>/subset/score-inline \
-H 'Content-Type: application/json' \
-d '{"indices": [0, 3, 17], "profile_id": 4}'
# Export just the subset
curl -sS -X POST http://localhost:8000/api/v1/batch/<job_id>/subset/export \
-H 'Content-Type: application/json' \
-d '{"indices": [0, 3, 17], "format": "sdf"}' -o subset.sdf
curl -sS "http://localhost:8000/api/v1/batch/<job_id>/export?format=excel&include_images=true&sheet_layout=multi" \
-o results.xlsx
curl -sS "http://localhost:8000/api/v1/batch/<job_id>/export?format=pdf&include_audit=true§ions=validation_summary,score_distribution" \
-o report.pdf
Formats (details in references/export-formats.md):
| Format | Content |
|---|---|
csv | All validation, scoring, and alert columns |
excel | Multi-sheet XLSX, optional 2D images, conditional coloring |
sdf | MOL blocks with properties; include_audit=true for full audit |
json | Full result objects with nested metadata |
pdf | Professional report; charts, stats, images; section-selectable |
fingerprint | ZIP with Morgan/MACCS/RDKit fingerprints as CSV, .npy, .npz |
dedup | ZIP with deduplication summary and per-group annotated CSVs |
scaffold | CSV with Murcko scaffold SMILES and scaffold-group assignment |
property_matrix | ZIP with flat CSV + multi-sheet Excel of all properties |
Filters: score_min, score_max, status (success/error/warning), indices=0,1,5,23 (GET) or POST body {"indices": [...]} for large selections.
curl -sS -X DELETE http://localhost:8000/api/v1/batch/<job_id>
Already-completed results are retained.
POST /api/v1/batch/upload with the file → capture job_id.GET /api/v1/batch/<job_id>/status until status == "complete".GET /api/v1/batch/<job_id>/export?format=sdf&status=error → save to failures.sdf.POST /api/v1/batch/<job_id>/analytics/scaffold.GET /api/v1/batch/<job_id>/analytics until the scaffold key moves to status: complete.scaffold.groups[] and render a Murcko-scaffold frequency chart.POST /api/v1/batch/<job_id>/analytics/clustering with {"cutoff": 0.30} (distance, not similarity — 0.30 distance ≈ 0.70 similarity).clustering.clusters[]; each contains centroid_index./batch/upload: 3/minute./batch/detect-columns: 10/minute./batch/<job_id> (paginated results): 60/minute./batch/<job_id>/status, /stats, DELETE /batch/<job_id>: 10/minute./batch/<job_id>/analytics (status poll): 120/minute./batch/<job_id>/analytics/<type> (trigger): 10/minute./batch/<job_id>/mcs: 30/minute./batch/<job_id>/subset/revalidate, /subset/rescore, /subset/export: 10/minute./batch/<job_id>/subset/score-inline: 30/minute./batch/<job_id>/export (GET and POST): 30/minute.API key via X-API-Key raises the per-minute tier.
Check /api/v1/config — deployment limits are profile-driven. Split the file or redeploy with a larger profile (./deploy.sh large).
CSV column wrong, or every row has a parse error. Run /batch/detect-columns first, or pre-validate with /diagnostics/file-prevalidate (see chemaudit-diagnostics).
Malicious-content scan matched a pattern (script tags, macros). Legitimate hits are rare — audit-logged server-side. Sanitize the source file.
Pattern-validated; must be a normal user@domain.tld under 254 chars.
/batch/<job_id> a few minutes after completionResults TTL in Redis is BATCH_RESULT_TTL (default 1 hour). Download exports before expiry or increase the TTL.
D-06 hard cap. Filter (by score, alert, or scaffold) before triggering clustering.
_WS_MAX_PER_JOB = 10).computingThe Celery worker crashed mid-task. Trigger the same analysis again — the idempotency guard resets computing to queued when params change; otherwise flush the status key manually.
references/export-formats.md — exhaustive spec of all nine export formats.chemaudit-qsar-ready — for ML-dataset curation pipelines on the same file input.chemaudit-structure-filter — for generative-chemistry funnel filtering on a SMILES list.chemaudit-dataset-intelligence — for dataset health auditing with activity columns.