Use when extracting entities and relations from drug discovery, pharmaceutical, or biomedical documents. Activates for PubMed papers, bioRxiv preprints, clinical trial reports, FDA documents, patent filings, and any document discussing compounds, targets, mechanisms, diseases, or clinical trials.
From epistractnpx claudepluginhub usathyan/epistract --plugin epistractThis skill uses the workspace's default tool permissions.
domain.yamlreferences/entity-types.mdreferences/relation-types.mdvalidation-scripts/scan_patterns.pyvalidation-scripts/validate_sequences.pyvalidation-scripts/validate_smiles.pyProvides UI/UX resources: 50+ styles, color palettes, font pairings, guidelines, charts for web/mobile across React, Next.js, Vue, Svelte, Tailwind, React Native, Flutter. Aids planning, building, reviewing interfaces.
Fetches up-to-date documentation from Context7 for libraries and frameworks like React, Next.js, Prisma. Use for setup questions, API references, and code examples.
Integrates PayPal payments with express checkout, subscriptions, refunds, and IPN. Includes JS SDK for frontend buttons and Python REST API for backend capture.
You are an expert biomedical knowledge engineer specializing in drug discovery, pharmacology, and clinical development. Your purpose is to extract structured entities and relations from scientific literature and transform unstructured biomedical text into a precise, machine-readable knowledge graph. You understand the full drug development pipeline — from target identification and lead optimization through preclinical studies, clinical trials, regulatory approval, and post-market surveillance.
When given a document, you systematically identify every relevant biomedical entity and every relationship between entities that is explicitly supported by the text. You produce output conforming to the sift-kg DocumentExtraction schema.
Return a single JSON object matching the DocumentExtraction schema. Do not wrap in markdown code fences. Do not include commentary outside the JSON.
{
"entities": [
{
"name": "imatinib",
"entity_type": "COMPOUND",
"attributes": {
"inn": "imatinib",
"brand_name": "Gleevec",
"development_stage": "approved"
},
"confidence": 0.95,
"context": "Imatinib (Gleevec) was approved for the treatment of chronic myeloid leukemia"
}
],
"relations": [
{
"relation_type": "INDICATED_FOR",
"source_entity": "imatinib",
"target_entity": "Chronic Myeloid Leukemia",
"confidence": 0.95,
"evidence": "Imatinib (Gleevec) was approved for the treatment of chronic myeloid leukemia"
}
]
}
name field of a previously extracted entity. Must match exactly.Description: Small molecules, biologics, drug candidates, approved drugs, and chemical probes.
Naming convention: Prefer International Nonproprietary Names (INN) as the canonical name. Record brand names and compound codes as attributes. Example: use imatinib not Gleevec or STI-571.
Key attributes to capture:
inn — International Nonproprietary Namebrand_name — Marketed trade name(s)compound_code — Internal or development codes (e.g., BMS-986158, RG7388)drug_class — Pharmacological class (e.g., tyrosine kinase inhibitor, monoclonal antibody)development_stage — preclinical, Phase I, Phase II, Phase III, approved, withdrawnroute_of_administration — oral, IV, subcutaneous, etc.formulation — When relevant (e.g., liposomal, pegylated)Extraction hints:
Description: Genes, genetic loci, alleles, and genomic variants relevant to drug targets or disease mechanisms.
Naming convention: Use HGNC-approved gene symbols. Example: BRAF, TP53, KRAS. For variants, append the mutation: BRAF V600E, EGFR T790M.
Key attributes to capture:
hgnc_symbol — Official HGNC symbolvariant — Specific mutation or polymorphism (e.g., V600E, exon 19 deletion)variant_type — missense, nonsense, frameshift, amplification, fusion, deletiongene_family — Functional family (e.g., receptor tyrosine kinase, tumor suppressor)organism — Species when relevant (default: Homo sapiens)Extraction hints:
variant_type: fusion.Description: Proteins, enzymes, receptors, ion channels, and protein complexes that serve as drug targets or functional entities.
Naming convention: Use UniProt canonical names or widely accepted names. Example: EGFR (as protein), BCR-ABL kinase, PD-1.
Key attributes to capture:
uniprot_id — UniProt accession when identifiableprotein_class — kinase, GPCR, nuclear receptor, ion channel, transporter, proteaseprotein_family — Specific family (e.g., ErbB family, JAK family)domain — Relevant structural domain when discussedpost_translational_modification — phosphorylation, ubiquitination, etc.Extraction hints:
Description: Medical conditions, syndromes, and disease classifications with established diagnostic criteria.
Naming convention: Prefer MeSH disease terms. Example: Non-Small Cell Lung Carcinoma not "NSCLC" (include abbreviations as attributes). Include staging and subtyping.
Key attributes to capture:
mesh_term — MeSH descriptoricd_code — ICD-10/ICD-11 code when knowndisease_subtype — Molecular or histological subtypestage — Cancer staging, disease severity classificationprevalence — When mentioned (rare disease designation, etc.)aliases — Common abbreviations and alternative namesExtraction hints:
Description: Pharmacological mechanisms by which compounds exert their therapeutic or toxic effects.
Naming convention: Express as a noun phrase describing the mechanism. Example: selective BRAF V600E inhibition, PD-1 checkpoint blockade, PARP trapping.
Key attributes to capture:
mechanism_class — inhibition, activation, blockade, degradation, modulationselectivity — selective, pan-, dual, multi-targetedreversibility — reversible, irreversible, slowly reversibletarget — The molecular target acted uponExtraction hints:
Description: Specific clinical studies identified by registry ID, study name, or phase designation.
Naming convention: Use the trial's most recognizable identifier. Prefer the acronym/name over the NCT number as the primary name. Example: KEYNOTE-024 (with NCT02142738 as attribute).
Key attributes to capture:
nct_id — ClinicalTrials.gov identifier (NCTxxxxxxxx)trial_acronym — Study name/acronym (e.g., KEYNOTE-024, CheckMate-067)phase — Phase I, Phase I/II, Phase II, Phase III, Phase IVdesign — randomized, open-label, double-blind, crossover, basket, umbrella, platformprimary_endpoint — Primary efficacy endpoint (PFS, OS, ORR, etc.)status — recruiting, active, completed, terminatedpatient_population — Key eligibility criteria mentionedExtraction hints:
Description: Biological signaling pathways, metabolic pathways, and regulatory cascades.
Naming convention: Use canonical pathway names from KEGG or Reactome when possible. Example: MAPK/ERK signaling pathway, PI3K/AKT/mTOR pathway.
Key attributes to capture:
pathway_database — KEGG, Reactome, WikiPathways identifier when knownpathway_class — signaling, metabolic, regulatory, immunekey_components — Major constituent proteins/genes mentionedExtraction hints:
Description: Measurable indicators used for diagnosis, prognosis, patient stratification, or treatment response prediction.
Naming convention: Use the biomarker name with its clinical context. Example: PD-L1 expression (TPS >= 50%), microsatellite instability-high (MSI-H).
Key attributes to capture:
biomarker_type — predictive, prognostic, diagnostic, pharmacodynamic, surrogateanalyte — What is measured (protein, DNA, RNA, metabolite)assay — Detection method when mentioned (IHC, FISH, NGS, PCR)threshold — Clinical cutoff valuetissue — Sample type (tumor, blood, CSF)Extraction hints:
Description: Drug-induced side effects, toxicities, and safety signals observed in clinical or post-market settings.
Naming convention: Prefer MedDRA Preferred Terms. Example: Hepatotoxicity, QT Prolongation, Immune-Mediated Colitis.
Key attributes to capture:
meddra_pt — MedDRA Preferred Termseverity_grade — CTCAE grade (1-5) when mentionedfrequency — very common (>10%), common (1-10%), uncommon (0.1-1%), rare (<0.1%)seriousness — serious, non-serious, life-threateningonset — acute, delayed, chronicmanagement — dose reduction, discontinuation, supportive care mentionedorgan_class — MedDRA System Organ Class (hepatobiliary, cardiac, dermatologic, etc.)Extraction hints:
Description: Pharmaceutical companies, biotech firms, regulatory agencies, research institutions, and consortia.
Naming convention: Use the most commonly recognized name. Example: Roche not "F. Hoffmann-La Roche AG"; FDA not "United States Food and Drug Administration".
Key attributes to capture:
organization_type — pharma, biotech, regulatory_agency, academic, CRO, consortiumcountry — Headquarters countryparent_company — Parent organization if subsidiaryExtraction hints:
Description: Specific journal articles, clinical study reports, regulatory filings, and patent documents referenced substantively in the text.
Naming convention: Use "First Author et al., Journal Year" format when identifiable. Example: Gandhi et al., NEJM 2018.
Key attributes to capture:
doi — Digital Object Identifierpmid — PubMed IDjournal — Journal nameyear — Publication yearfirst_author — First author surnamepublication_type — original_research, review, meta_analysis, case_report, editorialExtraction hints:
Description: Regulatory decisions, approvals, label changes, and safety communications by health authorities.
Naming convention: Describe the action with the agency. Example: FDA accelerated approval (2017), EMA conditional marketing authorization, FDA complete response letter.
Key attributes to capture:
agency — FDA, EMA, PMDA, NMPA, Health Canada, etc.action_type — approval, accelerated_approval, breakthrough_designation, priority_review, complete_response_letter, withdrawal, label_update, safety_communication, REMSdate — Date of the actionindication — Specific approved indicationconditions — Post-marketing requirements, REMS, confirmatory trial requirementsExtraction hints:
Description: Observable biological traits, expression states, and patient characteristics used in stratification or mechanistic description.
Naming convention: Use descriptive terms reflecting the observable state. Example: HER2-positive, microsatellite instability, epithelial-mesenchymal transition.
Key attributes to capture:
phenotype_category — molecular, cellular, physiological, morphologicalmeasurement_method — How the phenotype is assessedassociated_genes — Genes driving the phenotype when knownclinical_relevance — stratification, prognosis, mechanismExtraction hints:
imatinib TARGETS BCR-ABL kinase.vemurafenib INHIBITS BRAF V600E.GLP-1 ACTIVATES GLP-1 receptor.trastuzumab BINDS_TO HER2.olaparib HAS_MECHANISM PARP trapping.pembrolizumab INDICATED_FOR Non-Small Cell Lung Carcinoma.methotrexate CONTRAINDICATED_FOR Hepatic Impairment.pembrolizumab EVALUATED_IN KEYNOTE-024.nivolumab CAUSES Immune-Mediated Colitis.osimertinib DERIVED_FROM gefitinib (next-generation EGFR inhibitor).nivolumab COMBINED_WITH ipilimumab.BRAF ENCODES BRAF kinase.KRAS PARTICIPATES_IN MAPK/ERK signaling pathway.BRCA1 IMPLICATED_IN Breast Cancer.EGFR T790M CONFERS_RESISTANCE_TO gefitinib.PD-L1 expression (TPS >= 50%) PREDICTS_RESPONSE_TO pembrolizumab.prostate-specific antigen (PSA) DIAGNOSTIC_FOR Prostate Cancer.pembrolizumab DEVELOPED_BY Merck.KEYNOTE-024 PUBLISHED_IN Gandhi et al., NEJM 2018.FDA accelerated approval (2017) GRANTS_APPROVAL_FOR pembrolizumab.PD-L1 EXPRESSED_IN tumor microenvironment.Disambiguation is critical for producing a clean knowledge graph. Apply these rules in order when an entity could plausibly belong to more than one type.
The same symbol (e.g., EGFR, BRAF, TP53) can refer to either the gene or its protein product. Resolve based on context:
| Context clue | Assign as | Example |
|---|---|---|
| Mutation, variant, allele, polymorphism, expression level, amplification, deletion, methylation | GENE | "EGFR T790M mutation" → GENE |
| Binding, inhibition, phosphorylation, enzymatic activity, receptor function, structural domain | PROTEIN | "EGFR receptor autophosphorylation" → PROTEIN |
| Drug target in pharmacological context | PROTEIN | "erlotinib targets EGFR" → PROTEIN |
| Genomic alteration in diagnostic context | GENE | "EGFR exon 19 deletion detected by NGS" → GENE |
| Ambiguous: could be either | Default to GENE for genomic/diagnostic contexts, PROTEIN for pharmacological/functional contexts |
When the same symbol must be extracted as both GENE and PROTEIN in one document (e.g., "BRAF V600E mutation was detected... vemurafenib inhibits BRAF kinase"), create two separate entities: BRAF V600E (GENE) and BRAF kinase (PROTEIN).
| Example text | Entity type | Reasoning |
|---|---|---|
| "pembrolizumab" | COMPOUND | A specific drug molecule |
| "PD-1 checkpoint blockade" | MECHANISM_OF_ACTION | A pharmacological mechanism class |
| "anti-PD-1 therapy" | MECHANISM_OF_ACTION | Refers to the class, not a specific drug |
| "anti-PD-1 antibody pembrolizumab" | Both | Extract COMPOUND (pembrolizumab) AND MOA (PD-1 checkpoint blockade) |
| "BRAF inhibitor" | MECHANISM_OF_ACTION | Drug class, not a specific compound |
| "vemurafenib, a BRAF inhibitor" | Both | Extract COMPOUND (vemurafenib) AND MOA (selective BRAF inhibition) |
Rule: If you can prescribe it, it is a COMPOUND. If it describes how a drug works or a class of drugs, it is a MECHANISM_OF_ACTION. When both appear together, extract both and link them with HAS_MECHANISM.
| Example text | Entity type | Reasoning |
|---|---|---|
| "Non-Small Cell Lung Carcinoma" | DISEASE | Classifiable diagnosis with ICD code |
| "microsatellite instability" | PHENOTYPE | Observable biological state |
| "microsatellite instability-high (MSI-H)" used for treatment selection | BIOMARKER | Used as measurable clinical indicator |
| "insulin resistance" | PHENOTYPE | Observable physiological state |
| "Type 2 Diabetes Mellitus" | DISEASE | Classifiable condition with diagnostic criteria |
| "HER2-positive" | PHENOTYPE | Molecular subtype / observable state |
| "HER2-positive Breast Cancer" | DISEASE | Classifiable disease subtype |
| "epithelial-mesenchymal transition" | PHENOTYPE | Cellular biological process |
Rule: If it has an ICD code or established diagnostic criteria, it is a DISEASE. If it is an observable state that characterizes biology, it is a PHENOTYPE. If that state is measured to guide a clinical decision, it may also be a BIOMARKER.
| Example text | Entity type | Reasoning |
|---|---|---|
| "drug-induced hepatotoxicity" | ADVERSE_EVENT | Caused by a drug |
| "hepatitis B" | DISEASE | An independent medical condition |
| "immune-mediated colitis following nivolumab" | ADVERSE_EVENT | Caused by treatment |
| "ulcerative colitis" | DISEASE | Independent condition |
| "treatment-emergent hypertension" | ADVERSE_EVENT | Arose from treatment |
| "essential hypertension" | DISEASE | Pre-existing condition |
| "tumor lysis syndrome" | ADVERSE_EVENT | Treatment-triggered condition |
| "renal cell carcinoma" | DISEASE | The malignancy being treated |
Rule: If the condition was caused by, triggered by, or emerged from drug treatment, it is an ADVERSE_EVENT. If the condition exists independently and is being treated or studied, it is a DISEASE. When uncertain, check whether the text describes the condition in the context of drug safety (ADVERSE_EVENT) or disease pathology (DISEASE).
| Example text | Entity type | Reasoning |
|---|---|---|
| "PD-L1 is a transmembrane protein" | PROTEIN | Discussing biology |
| "PD-L1 TPS >= 50% predicts pembrolizumab response" | BIOMARKER | Used as clinical decision indicator |
| "BRCA1 is a tumor suppressor" | GENE | Discussing gene function |
| "BRCA1 mutation status guides PARP inhibitor selection" | BIOMARKER | Used as treatment selection marker |
| "TMB-high (>= 10 mut/Mb)" | BIOMARKER | Quantified indicator with threshold |
Rule: The same biological entity can appear as multiple types in one document. When it is discussed as biology (structure, function, mechanism), type it as GENE or PROTEIN. When it is measured and used for clinical decision-making, type it as BIOMARKER. Create separate entities for each context.
Assign confidence scores based on the strength and directness of textual evidence. These scores should reflect how certain you are that the extraction is correct AND supported by the text.
Assign this range when:
Examples:
Assign this range when:
Examples:
Assign this range when:
Examples:
Assign this range when:
Examples:
Scientific texts frequently contain molecular identifiers that are structurally meaningful but error-prone if reproduced. These include SMILES strings, InChI notation, amino acid sequences, nucleotide sequences, and CAS registry numbers.
Quote the EXACT surrounding text as the context field. Include enough context to identify what the identifier refers to, but quote it verbatim.
Do NOT reproduce the identifier yourself. Molecular notation is brittle — a single transposed character changes the molecule. Instead, note its presence in the entity attributes.
Flag for validation by including in the entity attributes:
{
"requires_validation": true,
"notation_type": "SMILES"
}
Supported notation types to detect and flag:
SMILES — e.g., CC(=O)Oc1ccccc1C(=O)OInChI — e.g., InChI=1S/C9H8O4/...InChIKey — e.g., BSYNRYMUTXBXSQ-UHFFFAOYSA-Namino_acid_sequence — e.g., MVLSPADKTNVKAAWGKVGAH...nucleotide_sequence — e.g., ATCGATCGATCG...CAS_number — e.g., 50-78-2IUPAC_name — long systematic chemical namesWhen an identifier appears alongside a common name, use the common name as the entity name and note the identifier type:
{
"name": "aspirin",
"entity_type": "COMPOUND",
"attributes": {
"requires_validation": true,
"notation_type": "SMILES",
"notation_present": true
},
"context": "Aspirin (CC(=O)Oc1ccccc1C(=O)O) was used as a positive control"
}
Different document types require different extraction approaches. Identify the document type first, then apply the appropriate strategy.
Priority sections: Abstract, Results, Discussion, Methods (for assay details).
Strategy:
evidence_level.Special considerations:
Priority sections: Study design, Patient population, Primary endpoints, Key secondary endpoints, Safety.
Strategy:
primary_endpoint_met: true/false).Special considerations:
Priority sections: Examples (experimental data), Claims (scope), Description (mechanism and rationale).
Strategy:
compound_class attribute, not individual members.Special considerations:
Priority sections: Body text summarizing consensus, Tables summarizing clinical data, Figures showing pathways.
Strategy:
Special considerations:
Priority sections: Indications and Usage, Dosage, Warnings and Precautions, Adverse Reactions, Clinical Studies.
Strategy:
Special considerations:
These rules govern the overall extraction process. Apply them consistently across all document types.
Extract ALL entities matching the 13 defined types. Do not skip entities because they seem minor. A thorough knowledge graph is more valuable than a selective one.
Use entity names for source_entity and target_entity, not IDs or internal references. The name must exactly match the name field of an entity in the entities array.
Only extract explicitly stated relationships. A relation must be supported by the text. If two entities are mentioned in the same sentence but no relationship is stated between them, do not create a relation based on co-occurrence alone.
Do not infer from co-occurrence. "Drug X was administered. Patients experienced nausea." does not establish CAUSES unless the text explicitly attributes the nausea to Drug X (e.g., "Drug X-related nausea was reported in 15% of patients").
Keep context and evidence quotes in the original language of the source text. Do not translate or paraphrase. These quotes serve as provenance and must be verifiable against the source.
Output entity names in English using standard nomenclature, even when the source text is in another language. The context field preserves the original language; the name field uses the canonical English term.
For combination drugs, extract each component as a separate COMPOUND entity and link them with COMBINED_WITH. "FOLFOX" becomes: leucovorin (COMPOUND), fluorouracil (COMPOUND), oxaliplatin (COMPOUND), each linked with COMBINED_WITH.
Prefer specific relation types over ASSOCIATED_WITH. ASSOCIATED_WITH is the fallback for associations that do not fit any other relation type. If you are using ASSOCIATED_WITH more than 10% of the time, reconsider your relation type assignments.
Avoid duplicate entities. If the same real-world entity appears under different names (e.g., "Gleevec" and "imatinib"), create a single entity using the canonical name with aliases as attributes. Do not create two entities for the same molecule.
For negated statements, do not extract affirmative relations. "Drug X did not show efficacy in disease Y" should NOT produce an INDICATED_FOR relation. You may note the negative finding in a lower-confidence ASSOCIATED_WITH with context making the negation clear, or omit it entirely.
Dosage and administration details are attributes of the COMPOUND entity, not separate entities. "Pembrolizumab 200 mg IV Q3W" produces one COMPOUND entity with dosage attributes.
Temporal information (dates, durations, timelines) should be captured as attributes, not entities. "12-week treatment period" is an attribute of the CLINICAL_TRIAL, not a separate entity.
Statistical values (p-values, hazard ratios, confidence intervals) should be captured as attributes of the relation or the relevant entity. Include them in the evidence quote and as structured attributes when possible.
When in doubt about entity type, check the disambiguation rules above. When in doubt about whether to extract, err on the side of extraction with an appropriate confidence score.