From chemaudit
Diagnoses SMILES parse errors with position/fix suggestions, compares InChI layers, checks round-trip lossiness via InChI/MOL, compares standardization pipelines, pre-validates SDF/CSV files.
npx claudepluginhub kohulan/chemauditThis skill is limited to using the following tools:
Five targeted diagnostic endpoints. Each answers a different "why didn't this work?" question.
Standardizes chemical structures via ChemAudit's ChEMBL-compatible pipeline (checker, standardizer, getparent, optional tautomer) with per-stage provenance, atom-level diffs, and InChIKey comparisons. Activates on 'standardize molecule', 'canonical SMILES', 'strip salts' queries.
Analyzes molecules using RDKit: parses SMILES/SDF, computes descriptors (MW, LogP, TPSA), fingerprints (Morgan, MACCS), Tanimoto similarity, SMARTS filtering, Lipinski checks, reactions, 2D/3D coords.
Guides RDKit for advanced cheminformatics: SMILES/SDF parsing, descriptors (MW, LogP, TPSA), fingerprints, substructure search, 2D/3D generation, similarity, reactions. Use for custom molecular workflows.
Share bugs, ideas, or general feedback.
Five targeted diagnostic endpoints. Each answers a different "why didn't this work?" question.
| Endpoint | Answers |
|---|---|
POST /api/v1/diagnostics/smiles | "Why does this SMILES fail to parse?" |
POST /api/v1/diagnostics/inchi-diff | "How do these two InChI strings differ?" |
POST /api/v1/diagnostics/roundtrip | "Does this SMILES survive a round-trip through InChI (or MOL)?" |
POST /api/v1/diagnostics/cross-pipeline | "Do three different standardization pipelines agree?" |
POST /api/v1/diagnostics/file-prevalidate | "Is this SDF/CSV structurally sound before batch upload?" |
Uses a dual strategy: RDKit DetectChemistryProblems for parseable-but-problematic SMILES, log-capture for unparseable SMILES. Returns error type, character position, and ranked fix suggestions.
curl -sS -X POST http://localhost:8000/api/v1/diagnostics/smiles \
-H 'Content-Type: application/json' \
-d '{"smiles": "C(C)(C)(C)(C)C"}'
Response:
{
"valid": false,
"canonical_smiles": null,
"warnings": [],
"errors": [
{
"raw_message": "Explicit valence for atom ... is 5, is greater than permitted",
"position": 0,
"error_type": "valence_error",
"message": "Carbon at position 0 has 5 bonds (max: 4)",
"suggestions": [
{
"description": "Remove one neighbor branch",
"corrected_smiles": "C(C)(C)(C)C",
"confidence": 0.9
}
]
}
]
}
Error types returned: unmatched_bracket, valence_error, ring_closure_mismatch, unknown_atom_symbol, invalid_charge, parse_error.
For valid-but-suspicious SMILES, valid: true with populated warnings[] (RDKit chemistry warnings, e.g. kekulization issues).
Pure string comparison — no RDKit. Parses each InChI into its constituent layers and produces a per-layer diff table.
curl -sS -X POST http://localhost:8000/api/v1/diagnostics/inchi-diff \
-H 'Content-Type: application/json' \
-d '{
"inchi_a": "InChI=1S/C9H8O4/c1-6(10)13-8-5-3-2-4-7(8)9(11)12/h2-5H,1H3,(H,11,12)",
"inchi_b": "InChI=1S/C9H8O4/c1-6(10)13-8-5-3-2-4-7(8)9(11)12/h2-5H,1H3,(H,11,12)/t7-/m1/s1"
}'
Response:
{
"identical": false,
"layer_rows": [
{"layer": "formula", "value_a": "C9H8O4", "value_b": "C9H8O4", "match": true},
{"layer": "connections", "value_a": "c1-6(...)", "value_b": "c1-6(...)", "match": true},
{"layer": "hydrogens", "value_a": "h2-5H,1H3,(H,11,12)", "value_b": "h2-5H,1H3,(H,11,12)", "match": true},
{"layer": "stereo_tetrahedral", "value_a": null, "value_b": "t7-", "match": false},
{"layer": "stereo_parity", "value_a": null, "value_b": "m1", "match": false},
{"layer": "stereo_marker", "value_a": null, "value_b": "s1", "match": false}
],
"layers_a": {"formula": "C9H8O4", "connections": "...", "hydrogens": "..."},
"layers_b": {"formula": "C9H8O4", "connections": "...", "hydrogens": "...", "stereo_tetrahedral": "t7-", ...}
}
Great for answering "same compound, stereo-defined vs racemic?" at a glance — rows with match=false isolate the exact disagreement.
Does the molecule survive conversion to an intermediate and back?
# SMILES → InChI → SMILES (detects stereo/isotope loss)
curl -sS -X POST http://localhost:8000/api/v1/diagnostics/roundtrip \
-H 'Content-Type: application/json' \
-d '{"smiles": "C[C@H](N)C(=O)O", "route": "smiles_inchi_smiles"}'
# SMILES → MOL block → SMILES (detects stereo/charge loss)
curl -sS -X POST http://localhost:8000/api/v1/diagnostics/roundtrip \
-H 'Content-Type: application/json' \
-d '{"smiles": "...", "route": "smiles_mol_smiles"}'
Response:
{
"route": "smiles_inchi_smiles",
"original_smiles": "C[C@H](N)C(=O)O",
"intermediate": "InChI=1S/C3H7NO2/c1-2(4)3(5)6/h2H,4H2,1H3,(H,5,6)/t2-/m0/s1",
"roundtrip_smiles": "C[C@H](N)C(=O)O",
"lossy": false,
"losses": [],
"error": null
}
When lossy:
{
"lossy": true,
"losses": [
{"type": "stereo", "description": "2 stereocenters lost", "before": 2, "after": 0},
{"type": "isotope", "description": "Isotope label stripped", "before": 1, "after": 0}
]
}
Loss types: stereo, charge, isotope.
Runs the molecule through three pipelines and compares outputs. Useful for picking which pipeline to use on a new dataset, or diagnosing why standardized SMILES disagree between sources.
curl -sS -X POST http://localhost:8000/api/v1/diagnostics/cross-pipeline \
-H 'Content-Type: application/json' \
-d '{"molecule": "CC(=O)Oc1ccccc1C(=O)[O-].[Na+]", "format": "auto"}'
Response:
{
"pipelines": [
{
"name": "RDKit MolStandardize",
"smiles": "CC(=O)Oc1ccccc1C(=O)O",
"inchikey": "BSYNRYMUTXBXSQ-UHFFFAOYSA-N",
"mw": 180.157,
"formula": "C9H8O4",
"charge": 0,
"stereo_count": 0,
"error": null,
"highlight_atoms": [],
"highlight_bonds": []
},
{"name": "ChEMBL Pipeline", ...},
{"name": "Minimal", "smiles": "CC(=O)Oc1ccccc1C(=O)[O-].[Na+]", ...}
],
"disagreements": 3,
"structural_disagreements": 1,
"all_agree": false,
"property_comparison": [
{"property": "smiles", "values": [...], "agrees": false, "structural": true},
{"property": "inchikey", "values": [...], "agrees": false, "structural": true},
{"property": "mw", "values": [180.157, 180.157, 203.14], "agrees": false, "structural": false},
...
]
}
highlight_atoms / highlight_bonds mark atoms outside the cross-pipeline MCS — useful for rendering a side-by-side visual diff in the UI.
Catches file-level problems before you pay the cost of a batch upload.
curl -sS -X POST http://localhost:8000/api/v1/diagnostics/file-prevalidate \
-F "file=@compounds.sdf"
Response (SDF):
{
"file_type": "sdf",
"total_blocks": 1000,
"total_rows": null,
"encoding": null,
"issue_count": 3,
"issues": [
{"block": 42, "line": 1250, "issue_type": "missing_m_end", "severity": "error",
"description": "Block 42 is missing the 'M END' terminator"},
{"block": 87, "line": 2310, "issue_type": "malformed_count_line", "severity": "error",
"description": "Block 87 has a malformed counts line"},
{"block": null, "line": null, "issue_type": "suspicious_content", "severity": "warning",
"description": "Found pattern that could indicate embedded code"}
],
"valid": false
}
Response (CSV):
{
"file_type": "csv",
"total_blocks": null,
"total_rows": 500,
"encoding": "utf-8",
"issue_count": 1,
"issues": [
{"block": null, "line": null, "issue_type": "missing_smiles_column", "severity": "error",
"description": "No column matches 'SMILES' (case-insensitive)"}
],
"valid": false
}
Issue types: missing_m_end, malformed_count_line, encoding_fallback, encoding_error, suspicious_content, missing_smiles_column, empty_rows, empty_file, duplicate_columns.
Severity levels: error, warning, info. valid=false only when at least one error severity appears — warnings don't block validity.
Max upload size: 50 MB (hard-coded in this endpoint).
POST /diagnostics/smiles with the bad SMILES.errors[0].error_type and message — tells you what.errors[0].position — tells you where in the string.errors[0].suggestions[0].corrected_smiles — first suggestion is ranked highest confidence.POST /diagnostics/inchi-diff with both.identical — if true, you're done.false, scan layer_rows[] for rows with match=false. Formula differences mean different compounds; only stereo/isotope differences mean "same skeleton, different specification".POST /diagnostics/roundtrip with route="smiles_mol_smiles".lossy=true means something will be lost when saving. The losses[] array tells you what.POST /diagnostics/file-prevalidate with the file.valid=true, go straight to /batch/upload.valid=false, fix each severity=error issue; warnings can be ignored at your discretion. Common fix: append M END to blocks missing it, or re-save from RDKit./diagnostics/smiles, /diagnostics/inchi-diff, /diagnostics/roundtrip: 30/min./diagnostics/cross-pipeline, /diagnostics/file-prevalidate: 10/min.The validator blocks `< > & ; | $ ``. These chars are never valid in SMILES — strip client-side.
errors[] is empty but valid=falseRare but possible for cases where the parse error is in the RDKit C++ layer with no accompanying log message. Try /standardize to see if the ChEMBL pipeline's error message is more informative.
Check losses[].type:
stereo: the intermediate format doesn't encode your stereo (e.g. atropisomerism).isotope: InChI's optional isotope layer was dropped in the round trip (shouldn't happen with standard InChI but occasionally does on edge cases).charge: MOL v2000 loses some charge states.Usually means the input is unusual (organometallic, radical, mixture). Treat each pipeline's output as "one opinion of many" and pick the one matching the source database's convention (ChEMBL for ChEMBL-derived data, RDKit for exotic cases).
The file wasn't valid UTF-8; the validator fell back to Latin-1 and succeeded. Usually fine but worth noting — re-save as UTF-8 if you'll reuse the file.
Pattern-matched against script/macro patterns. False positives happen on legitimate SMILES containing <, >, or similar characters (rare). Audit-logged server-side regardless.
chemaudit-molecule-validation — for check-level issues once a SMILES parses.chemaudit-standardization — for the ChEMBL-style standardization pipeline.chemaudit-batch-validation — pair /file-prevalidate with /batch/upload.