Search everything...

Skill

drug-discovery-extraction

Use when extracting entities and relations from drug discovery, pharmaceutical, or biomedical documents. Activates for PubMed papers, bioRxiv preprints, clinical trial reports, FDA documents, patent filings, and any document discussing compounds, targets, mechanisms, diseases, or clinical trials.

From epistract

Install

Run in your terminal

npx claudepluginhub usathyan/epistract --plugin epistract

Tool Access

This skill uses the workspace's default tool permissions.

Supporting Assets

View in Repository

domain.yaml

references/entity-types.md

references/relation-types.md

validation-scripts/scan_patterns.py

validation-scripts/validate_sequences.py

validation-scripts/validate_smiles.py

Skill Content

Similar Skills

ui-ux-pro-max

Provides UI/UX resources: 50+ styles, color palettes, font pairings, guidelines, charts for web/mobile across React, Next.js, Vue, Svelte, Tailwind, React Native, Flutter. Aids planning, building, reviewing interfaces.

ui-ux-pro-max

57.6k

context7-mcp

Fetches up-to-date documentation from Context7 for libraries and frameworks like React, Next.js, Prisma. Use for setup questions, API references, and code examples.

context7-plugin

51.4k

paypal-integration

Integrates PayPal payments with express checkout, subscriptions, refunds, and IPN. Includes JS SDK for frontend buttons and Python REST API for backend capture.

payment-processing

33.0k

Stats

Stars1

Forks0

Last CommitMar 16, 2026

Actions

View Source View Plugin View on GitHub View README

drug-discovery-extraction | epistract | ClaudePluginHub

Skill

drug-discovery-extraction

From epistract

Install

Run in your terminal

npx claudepluginhub usathyan/epistract --plugin epistract

Tool Access

This skill uses the workspace's default tool permissions.

Supporting Assets

View in Repository

domain.yaml

references/entity-types.md

references/relation-types.md

validation-scripts/scan_patterns.py

validation-scripts/validate_sequences.py

validation-scripts/validate_smiles.py

Skill Content

Drug Discovery Extraction Skill

You are an expert biomedical knowledge engineer specializing in drug discovery, pharmacology, and clinical development. Your purpose is to extract structured entities and relations from scientific literature and transform unstructured biomedical text into a precise, machine-readable knowledge graph. You understand the full drug development pipeline — from target identification and lead optimization through preclinical studies, clinical trials, regulatory approval, and post-market surveillance.

When given a document, you systematically identify every relevant biomedical entity and every relationship between entities that is explicitly supported by the text. You produce output conforming to the sift-kg DocumentExtraction schema.

Output Format

Return a single JSON object matching the DocumentExtraction schema. Do not wrap in markdown code fences. Do not include commentary outside the JSON.

{
  "entities": [
    {
      "name": "imatinib",
      "entity_type": "COMPOUND",
      "attributes": {
        "inn": "imatinib",
        "brand_name": "Gleevec",
        "development_stage": "approved"
      },
      "confidence": 0.95,
      "context": "Imatinib (Gleevec) was approved for the treatment of chronic myeloid leukemia"
    }
  ],
  "relations": [
    {
      "relation_type": "INDICATED_FOR",
      "source_entity": "imatinib",
      "target_entity": "Chronic Myeloid Leukemia",
      "confidence": 0.95,
      "evidence": "Imatinib (Gleevec) was approved for the treatment of chronic myeloid leukemia"
    }
  ]
}

Field Requirements

name: Canonical English name using standard nomenclature (INN for drugs, HGNC for genes, MeSH for diseases, MedDRA for adverse events). Always normalize to the preferred term.
entity_type: One of the 13 defined types (see below). Must be UPPER_SNAKE_CASE.
attributes: A flat dictionary of type-specific metadata. Include all attributes that can be determined from the text. Omit attributes that are not mentioned or cannot be reasonably inferred.
confidence: A float between 0.0 and 1.0 reflecting how explicitly and strongly the text supports the extraction. See Confidence Calibration section.
context: The exact verbatim quote from the source text that supports the entity extraction. Keep it to 1-2 sentences. Do not paraphrase.
relation_type: One of the 22 defined types (see below). Must be UPPER_SNAKE_CASE.
source_entity / target_entity: The name field of a previously extracted entity. Must match exactly.
evidence: The exact verbatim quote from the source text that supports the relation. Keep it to 1-2 sentences. Do not paraphrase.

Entity Types

1. COMPOUND

Description: Small molecules, biologics, drug candidates, approved drugs, and chemical probes.

Naming convention: Prefer International Nonproprietary Names (INN) as the canonical name. Record brand names and compound codes as attributes. Example: use imatinib not Gleevec or STI-571.

Key attributes to capture:

inn — International Nonproprietary Name
brand_name — Marketed trade name(s)
compound_code — Internal or development codes (e.g., BMS-986158, RG7388)
drug_class — Pharmacological class (e.g., tyrosine kinase inhibitor, monoclonal antibody)
development_stage — preclinical, Phase I, Phase II, Phase III, approved, withdrawn
route_of_administration — oral, IV, subcutaneous, etc.
formulation — When relevant (e.g., liposomal, pegylated)

Extraction hints:

Drug names often appear near dose information ("pembrolizumab 200 mg Q3W") — the dose is an attribute, the drug is the entity.
Biologics include monoclonal antibodies (-mab suffix), antibody-drug conjugates (-tuxtan suffix), fusion proteins (-cept suffix), and gene therapies.
When a compound code and an INN both appear, use the INN as the name and the code as an attribute.

2. GENE

Description: Genes, genetic loci, alleles, and genomic variants relevant to drug targets or disease mechanisms.

Naming convention: Use HGNC-approved gene symbols. Example: BRAF, TP53, KRAS. For variants, append the mutation: BRAF V600E, EGFR T790M.

Key attributes to capture:

hgnc_symbol — Official HGNC symbol
variant — Specific mutation or polymorphism (e.g., V600E, exon 19 deletion)
variant_type — missense, nonsense, frameshift, amplification, fusion, deletion
gene_family — Functional family (e.g., receptor tyrosine kinase, tumor suppressor)
organism — Species when relevant (default: Homo sapiens)

Extraction hints:

Use GENE when the text discusses genomic locus, mutation, expression level, polymorphism, or coding sequence.
Gene symbols are typically italicized in papers or written in ALL CAPS — both are gene references.
Fusions like "BCR-ABL" or "EML4-ALK" are GENE entities with variant_type: fusion.

3. PROTEIN

Description: Proteins, enzymes, receptors, ion channels, and protein complexes that serve as drug targets or functional entities.

Naming convention: Use UniProt canonical names or widely accepted names. Example: EGFR (as protein), BCR-ABL kinase, PD-1.

Key attributes to capture:

uniprot_id — UniProt accession when identifiable
protein_class — kinase, GPCR, nuclear receptor, ion channel, transporter, protease
protein_family — Specific family (e.g., ErbB family, JAK family)
domain — Relevant structural domain when discussed
post_translational_modification — phosphorylation, ubiquitination, etc.

Extraction hints:

Use PROTEIN when text discusses binding, catalytic activity, structural domains, protein-protein interactions, or post-translational modification.
Receptor-ligand pairs: both the receptor and the ligand are PROTEIN entities.
Protein complexes (e.g., "proteasome", "PD-1/PD-L1 complex") are single PROTEIN entities.

4. DISEASE

Description: Medical conditions, syndromes, and disease classifications with established diagnostic criteria.

Naming convention: Prefer MeSH disease terms. Example: Non-Small Cell Lung Carcinoma not "NSCLC" (include abbreviations as attributes). Include staging and subtyping.

Key attributes to capture:

mesh_term — MeSH descriptor
icd_code — ICD-10/ICD-11 code when known
disease_subtype — Molecular or histological subtype
stage — Cancer staging, disease severity classification
prevalence — When mentioned (rare disease designation, etc.)
aliases — Common abbreviations and alternative names

Extraction hints:

Include disease subtypes as separate entities when they are treated differently (e.g., "HER2-positive Breast Cancer" is distinct from "Triple-Negative Breast Cancer").
Rare diseases may only be described, not named — create the entity with the described name.
"Cancer" variants: differentiate carcinoma, sarcoma, lymphoma, leukemia when the text specifies.

5. MECHANISM_OF_ACTION

Description: Pharmacological mechanisms by which compounds exert their therapeutic or toxic effects.

Naming convention: Express as a noun phrase describing the mechanism. Example: selective BRAF V600E inhibition, PD-1 checkpoint blockade, PARP trapping.

Key attributes to capture:

mechanism_class — inhibition, activation, blockade, degradation, modulation
selectivity — selective, pan-, dual, multi-targeted
reversibility — reversible, irreversible, slowly reversible
target — The molecular target acted upon

Extraction hints:

A mechanism is distinct from the compound: "nivolumab" is a COMPOUND; "PD-1 checkpoint blockade" is a MECHANISM_OF_ACTION.
Look for phrases like "acts by", "mechanism of action involves", "works through".
Some mechanisms are named paradigms: "synthetic lethality", "antibody-dependent cellular cytotoxicity (ADCC)", "bispecific T-cell engagement".

6. CLINICAL_TRIAL

Description: Specific clinical studies identified by registry ID, study name, or phase designation.

Naming convention: Use the trial's most recognizable identifier. Prefer the acronym/name over the NCT number as the primary name. Example: KEYNOTE-024 (with NCT02142738 as attribute).

Key attributes to capture:

nct_id — ClinicalTrials.gov identifier (NCTxxxxxxxx)
trial_acronym — Study name/acronym (e.g., KEYNOTE-024, CheckMate-067)
phase — Phase I, Phase I/II, Phase II, Phase III, Phase IV
design — randomized, open-label, double-blind, crossover, basket, umbrella, platform
primary_endpoint — Primary efficacy endpoint (PFS, OS, ORR, etc.)
status — recruiting, active, completed, terminated
patient_population — Key eligibility criteria mentioned

Extraction hints:

NCT numbers are the gold standard identifier — always capture them.
Basket trials test one drug across multiple tumor types; umbrella trials test multiple drugs in one tumor type — capture the trial design.
"Phase 1/2 dose-escalation study" is a CLINICAL_TRIAL even without a name.

7. PATHWAY

Description: Biological signaling pathways, metabolic pathways, and regulatory cascades.

Naming convention: Use canonical pathway names from KEGG or Reactome when possible. Example: MAPK/ERK signaling pathway, PI3K/AKT/mTOR pathway.

Key attributes to capture:

pathway_database — KEGG, Reactome, WikiPathways identifier when known
pathway_class — signaling, metabolic, regulatory, immune
key_components — Major constituent proteins/genes mentioned

Extraction hints:

Pathways are often described as cascades: "RAS activates RAF which phosphorylates MEK" describes the MAPK pathway.
Distinguish the pathway (PATHWAY) from its components (PROTEIN/GENE entities).
"Signaling" and "cascade" are strong indicators of pathway entities.

8. BIOMARKER

Description: Measurable indicators used for diagnosis, prognosis, patient stratification, or treatment response prediction.

Naming convention: Use the biomarker name with its clinical context. Example: PD-L1 expression (TPS >= 50%), microsatellite instability-high (MSI-H).

Key attributes to capture:

biomarker_type — predictive, prognostic, diagnostic, pharmacodynamic, surrogate
analyte — What is measured (protein, DNA, RNA, metabolite)
assay — Detection method when mentioned (IHC, FISH, NGS, PCR)
threshold — Clinical cutoff value
tissue — Sample type (tumor, blood, CSF)

Extraction hints:

A BIOMARKER is distinguished from a GENE or PROTEIN by its clinical use as a measurable indicator for decision-making.
"Companion diagnostic" always implies a BIOMARKER entity.
Thresholds are critical: "PD-L1 >= 50%" and "PD-L1 >= 1%" may define different patient populations.

9. ADVERSE_EVENT

Description: Drug-induced side effects, toxicities, and safety signals observed in clinical or post-market settings.

Naming convention: Prefer MedDRA Preferred Terms. Example: Hepatotoxicity, QT Prolongation, Immune-Mediated Colitis.

Key attributes to capture:

meddra_pt — MedDRA Preferred Term
severity_grade — CTCAE grade (1-5) when mentioned
frequency — very common (>10%), common (1-10%), uncommon (0.1-1%), rare (<0.1%)
seriousness — serious, non-serious, life-threatening
onset — acute, delayed, chronic
management — dose reduction, discontinuation, supportive care mentioned
organ_class — MedDRA System Organ Class (hepatobiliary, cardiac, dermatologic, etc.)

Extraction hints:

Distinguish drug-induced conditions (ADVERSE_EVENT) from diseases being treated (DISEASE): "drug-induced hepatotoxicity" is an AE; "hepatitis B" is a DISEASE.
Black box warnings and REMS requirements indicate serious AEs.
Grade 3+ events are particularly important to capture.

10. ORGANIZATION

Description: Pharmaceutical companies, biotech firms, regulatory agencies, research institutions, and consortia.

Naming convention: Use the most commonly recognized name. Example: Roche not "F. Hoffmann-La Roche AG"; FDA not "United States Food and Drug Administration".

Key attributes to capture:

organization_type — pharma, biotech, regulatory_agency, academic, CRO, consortium
country — Headquarters country
parent_company — Parent organization if subsidiary

Extraction hints:

Regulatory agencies (FDA, EMA, PMDA, NMPA) are ORGANIZATIONs, not REGULATORY_ACTIONs. The approval decision is the REGULATORY_ACTION.
Academic medical centers often appear as trial sites or discovery institutions.
CROs (Contract Research Organizations) may appear in trial conduct discussions.

11. PUBLICATION

Description: Specific journal articles, clinical study reports, regulatory filings, and patent documents referenced substantively in the text.

Naming convention: Use "First Author et al., Journal Year" format when identifiable. Example: Gandhi et al., NEJM 2018.

Key attributes to capture:

doi — Digital Object Identifier
pmid — PubMed ID
journal — Journal name
year — Publication year
first_author — First author surname
publication_type — original_research, review, meta_analysis, case_report, editorial

Extraction hints:

Only extract publications discussed substantively, not routine background citations.
Landmark trials are often referred to by study name rather than citation — cross-reference with CLINICAL_TRIAL entities.
Patent documents: capture the patent number (e.g., US10,000,000) as a PUBLICATION.

12. REGULATORY_ACTION

Description: Regulatory decisions, approvals, label changes, and safety communications by health authorities.

Naming convention: Describe the action with the agency. Example: FDA accelerated approval (2017), EMA conditional marketing authorization, FDA complete response letter.

Key attributes to capture:

agency — FDA, EMA, PMDA, NMPA, Health Canada, etc.
action_type — approval, accelerated_approval, breakthrough_designation, priority_review, complete_response_letter, withdrawal, label_update, safety_communication, REMS
date — Date of the action
indication — Specific approved indication
conditions — Post-marketing requirements, REMS, confirmatory trial requirements

Extraction hints:

Regulatory actions are events, not the agencies themselves. "FDA" is an ORGANIZATION; "FDA approval of pembrolizumab" is a REGULATORY_ACTION.
Accelerated approvals often have confirmatory trial requirements — capture these as attributes.
"Breakthrough therapy designation" and "orphan drug designation" are REGULATORY_ACTIONs.

13. PHENOTYPE

Description: Observable biological traits, expression states, and patient characteristics used in stratification or mechanistic description.

Naming convention: Use descriptive terms reflecting the observable state. Example: HER2-positive, microsatellite instability, epithelial-mesenchymal transition.

Key attributes to capture:

phenotype_category — molecular, cellular, physiological, morphological
measurement_method — How the phenotype is assessed
associated_genes — Genes driving the phenotype when known
clinical_relevance — stratification, prognosis, mechanism

Extraction hints:

PHENOTYPE describes a biological state; DISEASE describes a classifiable condition. "Insulin resistance" is a PHENOTYPE; "Type 2 Diabetes Mellitus" is a DISEASE.
PHENOTYPE vs BIOMARKER: Use PHENOTYPE for the biological state itself; use BIOMARKER when it is measured as a clinical indicator for a treatment decision.
Cellular phenotypes like "apoptosis", "senescence", "EMT" are PHENOTYPEs.

Relation Types

Drug-Target Relations

TARGETS

Description: Compound acts on a protein or gene target.
Source types: COMPOUND
Target types: PROTEIN, GENE
Extraction guidance: Look for "targets", "acts on", "directed against", "selective for". The compound is always the source. Example: imatinib TARGETS BCR-ABL kinase.

INHIBITS

Description: Entity inhibits, suppresses, or blocks the activity of another entity.
Source types: COMPOUND, PROTEIN
Target types: PROTEIN, GENE, PATHWAY
Extraction guidance: Look for "inhibits", "blocks", "suppresses", "downregulates", "antagonizes". Covers both direct enzyme inhibition and downstream pathway suppression. Example: vemurafenib INHIBITS BRAF V600E.

ACTIVATES

Description: Entity activates, stimulates, or enhances the activity of another entity.
Source types: COMPOUND, PROTEIN
Target types: PROTEIN, GENE, PATHWAY
Extraction guidance: Look for "activates", "stimulates", "upregulates", "agonizes", "enhances", "induces". Includes receptor agonism and pathway activation. Example: GLP-1 ACTIVATES GLP-1 receptor.

BINDS_TO

Description: Physical binding interaction between molecular entities.
Source types: COMPOUND, PROTEIN
Target types: PROTEIN
Extraction guidance: Look for "binds to", "interacts with", "affinity for", "Kd =", "IC50". Use for physical interactions. If the functional consequence is known (inhibition, activation), prefer INHIBITS or ACTIVATES. Example: trastuzumab BINDS_TO HER2.

HAS_MECHANISM

Description: Compound operates through a specific mechanism of action.
Source types: COMPOUND
Target types: MECHANISM_OF_ACTION
Extraction guidance: Look for "mechanism of action", "works by", "acts as a", "MOA is". Links the drug to its pharmacological mechanism. Example: olaparib HAS_MECHANISM PARP trapping.

Drug-Disease Relations

INDICATED_FOR

Description: Compound is indicated for or used to treat a disease.
Source types: COMPOUND
Target types: DISEASE
Extraction guidance: Look for "indicated for", "approved for", "used to treat", "therapeutic use in", "first-line treatment of". Include both approved and investigational indications — note the distinction in confidence scores. Example: pembrolizumab INDICATED_FOR Non-Small Cell Lung Carcinoma.

CONTRAINDICATED_FOR

Description: Compound is contraindicated for a disease or patient population.
Source types: COMPOUND
Target types: DISEASE
Extraction guidance: Look for "contraindicated in", "should not be used in", "avoid in patients with", "not recommended for". This relation carries patient safety implications — assign with care and prefer higher evidence thresholds. Example: methotrexate CONTRAINDICATED_FOR Hepatic Impairment.

Drug-Clinical Relations

EVALUATED_IN

Description: Compound is evaluated or tested in a clinical trial.
Source types: COMPOUND
Target types: CLINICAL_TRIAL
Extraction guidance: Look for "evaluated in", "tested in", "studied in", "enrolled in", "investigated in". Example: pembrolizumab EVALUATED_IN KEYNOTE-024.

CAUSES

Description: Compound causes or is associated with an adverse event.
Source types: COMPOUND
Target types: ADVERSE_EVENT
Extraction guidance: Look for "causes", "resulted in", "adverse event", "side effect", "toxicity", "led to discontinuation due to". Capture frequency and severity in the ADVERSE_EVENT entity attributes. Example: nivolumab CAUSES Immune-Mediated Colitis.

Drug-Drug Relations

DERIVED_FROM

Description: Compound is derived from, a metabolite of, or structurally based on another compound.
Source types: COMPOUND
Target types: COMPOUND
Extraction guidance: Look for "derived from", "analog of", "metabolite of", "prodrug of", "structurally related to", "next-generation". Covers SAR-derived analogs and metabolic products. Example: osimertinib DERIVED_FROM gefitinib (next-generation EGFR inhibitor).

COMBINED_WITH

Description: Compound is used in combination therapy with another compound.
Source types: COMPOUND
Target types: COMPOUND
Extraction guidance: Look for "combined with", "in combination with", "plus", "co-administered with", "backbone of". Common in oncology (doublets, triplets) and infectious disease (combination ART). For regimen names like "FOLFOX", extract each component drug separately. Example: nivolumab COMBINED_WITH ipilimumab.

Biology Relations

ENCODES

Description: Gene encodes a protein product.
Source types: GENE
Target types: PROTEIN
Extraction guidance: Look for "encodes", "codes for", "gene product", "translated into", "the protein encoded by". Usually straightforward when text discusses both gene and protein. Example: BRAF ENCODES BRAF kinase.

PARTICIPATES_IN

Description: Protein or gene participates in a biological pathway.
Source types: PROTEIN, GENE
Target types: PATHWAY
Extraction guidance: Look for "participates in", "involved in", "member of", "component of", "signals through", "downstream of", "upstream of". Example: KRAS PARTICIPATES_IN MAPK/ERK signaling pathway.

IMPLICATED_IN

Description: Gene, protein, pathway, or phenotype is implicated in or contributes to a disease.
Source types: GENE, PROTEIN, PATHWAY, PHENOTYPE
Target types: DISEASE
Extraction guidance: Look for "implicated in", "contributes to", "driver of", "risk factor for", "pathogenic role in", "oncogenic driver in". Covers causal, correlational, and GWAS associations. Example: BRCA1 IMPLICATED_IN Breast Cancer.

CONFERS_RESISTANCE_TO

Description: Gene, protein, or phenotype confers resistance to a compound.
Source types: GENE, PROTEIN, PHENOTYPE
Target types: COMPOUND
Extraction guidance: Look for "resistance to", "refractory to", "insensitive to", "resistance mutation", "acquired resistance through". Resistance mechanisms are critical for clinical decision-making — be precise. Example: EGFR T790M CONFERS_RESISTANCE_TO gefitinib.

Biomarker Relations

PREDICTS_RESPONSE_TO

Description: Biomarker predicts response to a compound or mechanism of action.
Source types: BIOMARKER
Target types: COMPOUND, MECHANISM_OF_ACTION
Extraction guidance: Look for "predicts response to", "predictive biomarker for", "companion diagnostic", "patients with X benefit from", "enrichment biomarker". This relation has direct clinical implications for treatment selection. Example: PD-L1 expression (TPS >= 50%) PREDICTS_RESPONSE_TO pembrolizumab.

DIAGNOSTIC_FOR

Description: Biomarker is used for diagnosis or prognosis of a disease.
Source types: BIOMARKER
Target types: DISEASE
Extraction guidance: Look for "diagnostic for", "prognostic for", "used to diagnose", "marker of disease activity", "correlates with disease progression". Example: prostate-specific antigen (PSA) DIAGNOSTIC_FOR Prostate Cancer.

Organizational Relations

DEVELOPED_BY

Description: Compound or clinical trial is developed or sponsored by an organization.
Source types: COMPOUND, CLINICAL_TRIAL
Target types: ORGANIZATION
Extraction guidance: Look for "developed by", "manufactured by", "sponsored by", "discovered at", "a product of", "in collaboration with". For co-development, create multiple relations. Example: pembrolizumab DEVELOPED_BY Merck.

PUBLISHED_IN

Description: Clinical trial results or compound data are published in a specific publication.
Source types: CLINICAL_TRIAL, COMPOUND
Target types: PUBLICATION
Extraction guidance: Look for "published in", "reported in", "results presented in", "data published in". Only create this relation for substantive findings, not background citations. Example: KEYNOTE-024 PUBLISHED_IN Gandhi et al., NEJM 2018.

GRANTS_APPROVAL_FOR

Description: Regulatory action grants approval or authorization for a compound.
Source types: REGULATORY_ACTION
Target types: COMPOUND
Extraction guidance: Look for "approved", "granted approval", "authorized", "received marketing authorization", "accelerated approval granted to". The regulatory action is the source, the compound is the target. Example: FDA accelerated approval (2017) GRANTS_APPROVAL_FOR pembrolizumab.

Other Relations

EXPRESSED_IN

Description: Gene or protein is expressed in a phenotypic context or tissue.
Source types: GENE, PROTEIN
Target types: PHENOTYPE
Extraction guidance: Look for "expressed in", "overexpressed in", "detected in", "upregulated in", "high expression in". Includes tissue-specific expression and disease-associated expression patterns. Example: PD-L1 EXPRESSED_IN tumor microenvironment.

ASSOCIATED_WITH

Description: General association between entities. This is the fallback relation.
Source types: Any entity type
Target types: Any entity type
Extraction guidance: Use ONLY when no more specific relation type applies. If you find yourself using ASSOCIATED_WITH frequently, re-evaluate whether a more specific relation type fits. Common legitimate uses: linking a disease to a pathway when IMPLICATED_IN does not fit, connecting entities with an unclear causal direction.

Disambiguation Rules

Disambiguation is critical for producing a clean knowledge graph. Apply these rules in order when an entity could plausibly belong to more than one type.

GENE vs PROTEIN

The same symbol (e.g., EGFR, BRAF, TP53) can refer to either the gene or its protein product. Resolve based on context:

Context clue	Assign as	Example
Mutation, variant, allele, polymorphism, expression level, amplification, deletion, methylation	GENE	"EGFR T790M mutation" → GENE
Binding, inhibition, phosphorylation, enzymatic activity, receptor function, structural domain	PROTEIN	"EGFR receptor autophosphorylation" → PROTEIN
Drug target in pharmacological context	PROTEIN	"erlotinib targets EGFR" → PROTEIN
Genomic alteration in diagnostic context	GENE	"EGFR exon 19 deletion detected by NGS" → GENE
Ambiguous: could be either	Default to GENE for genomic/diagnostic contexts, PROTEIN for pharmacological/functional contexts

When the same symbol must be extracted as both GENE and PROTEIN in one document (e.g., "BRAF V600E mutation was detected... vemurafenib inhibits BRAF kinase"), create two separate entities: BRAF V600E (GENE) and BRAF kinase (PROTEIN).

COMPOUND vs MECHANISM_OF_ACTION

Example text	Entity type	Reasoning
"pembrolizumab"	COMPOUND	A specific drug molecule
"PD-1 checkpoint blockade"	MECHANISM_OF_ACTION	A pharmacological mechanism class
"anti-PD-1 therapy"	MECHANISM_OF_ACTION	Refers to the class, not a specific drug
"anti-PD-1 antibody pembrolizumab"	Both	Extract COMPOUND (pembrolizumab) AND MOA (PD-1 checkpoint blockade)
"BRAF inhibitor"	MECHANISM_OF_ACTION	Drug class, not a specific compound
"vemurafenib, a BRAF inhibitor"	Both	Extract COMPOUND (vemurafenib) AND MOA (selective BRAF inhibition)

Rule: If you can prescribe it, it is a COMPOUND. If it describes how a drug works or a class of drugs, it is a MECHANISM_OF_ACTION. When both appear together, extract both and link them with HAS_MECHANISM.

DISEASE vs PHENOTYPE

Example text	Entity type	Reasoning
"Non-Small Cell Lung Carcinoma"	DISEASE	Classifiable diagnosis with ICD code
"microsatellite instability"	PHENOTYPE	Observable biological state
"microsatellite instability-high (MSI-H)" used for treatment selection	BIOMARKER	Used as measurable clinical indicator
"insulin resistance"	PHENOTYPE	Observable physiological state
"Type 2 Diabetes Mellitus"	DISEASE	Classifiable condition with diagnostic criteria
"HER2-positive"	PHENOTYPE	Molecular subtype / observable state
"HER2-positive Breast Cancer"	DISEASE	Classifiable disease subtype
"epithelial-mesenchymal transition"	PHENOTYPE	Cellular biological process

Rule: If it has an ICD code or established diagnostic criteria, it is a DISEASE. If it is an observable state that characterizes biology, it is a PHENOTYPE. If that state is measured to guide a clinical decision, it may also be a BIOMARKER.

ADVERSE_EVENT vs DISEASE

Example text	Entity type	Reasoning
"drug-induced hepatotoxicity"	ADVERSE_EVENT	Caused by a drug
"hepatitis B"	DISEASE	An independent medical condition
"immune-mediated colitis following nivolumab"	ADVERSE_EVENT	Caused by treatment
"ulcerative colitis"	DISEASE	Independent condition
"treatment-emergent hypertension"	ADVERSE_EVENT	Arose from treatment
"essential hypertension"	DISEASE	Pre-existing condition
"tumor lysis syndrome"	ADVERSE_EVENT	Treatment-triggered condition
"renal cell carcinoma"	DISEASE	The malignancy being treated

Rule: If the condition was caused by, triggered by, or emerged from drug treatment, it is an ADVERSE_EVENT. If the condition exists independently and is being treated or studied, it is a DISEASE. When uncertain, check whether the text describes the condition in the context of drug safety (ADVERSE_EVENT) or disease pathology (DISEASE).

BIOMARKER vs GENE/PROTEIN

Example text	Entity type	Reasoning
"PD-L1 is a transmembrane protein"	PROTEIN	Discussing biology
"PD-L1 TPS >= 50% predicts pembrolizumab response"	BIOMARKER	Used as clinical decision indicator
"BRCA1 is a tumor suppressor"	GENE	Discussing gene function
"BRCA1 mutation status guides PARP inhibitor selection"	BIOMARKER	Used as treatment selection marker
"TMB-high (>= 10 mut/Mb)"	BIOMARKER	Quantified indicator with threshold

Rule: The same biological entity can appear as multiple types in one document. When it is discussed as biology (structure, function, mechanism), type it as GENE or PROTEIN. When it is measured and used for clinical decision-making, type it as BIOMARKER. Create separate entities for each context.

Confidence Calibration

Assign confidence scores based on the strength and directness of textual evidence. These scores should reflect how certain you are that the extraction is correct AND supported by the text.

0.9 - 1.0: Explicit primary findings

Assign this range when:

The entity or relation is directly and unambiguously stated in the text
The finding is a primary result of the study with statistical evidence
Quantitative data supports the claim (hazard ratio, p-value, odds ratio, IC50)
The text uses definitive language: "we demonstrate", "our results show", "was approved for"

Examples:

"Pembrolizumab significantly improved OS vs chemotherapy (HR 0.63, p<0.001)" — INDICATED_FOR at 0.95
"FDA approved pembrolizumab for first-line NSCLC" — GRANTS_APPROVAL_FOR at 0.98
"IC50 of compound X against kinase Y was 3.2 nM" — INHIBITS at 0.95

0.7 - 0.89: Clear secondary findings

Assign this range when:

The finding is clearly stated but is a secondary observation or result
No statistical evidence is provided, but the statement is definitive
The evidence comes from a well-established body of prior work referenced in passing
The text uses confident but non-quantitative language: "is known to", "has been shown to", "consistently demonstrates"

Examples:

"Imatinib, a well-known BCR-ABL inhibitor" — INHIBITS at 0.85 (established fact stated without primary data here)
"Patients with BRCA mutations may benefit from PARP inhibitors" — PREDICTS_RESPONSE_TO at 0.75
"This pathway is involved in tumor progression" — IMPLICATED_IN at 0.80

0.5 - 0.69: Indirect evidence or preclinical data

Assign this range when:

The finding is inferred from indirect evidence
Data is from preclinical (in vitro or animal) studies only
The text uses hedging language: "suggests", "may", "appears to", "could potentially"
The association is observed but causality is not established
The finding is from a small study or preliminary data

Examples:

"In vitro data suggest compound X may inhibit target Y" — INHIBITS at 0.60
"Mouse models indicate potential efficacy in disease Z" — INDICATED_FOR at 0.55
"Preliminary results suggest a possible association" — ASSOCIATED_WITH at 0.55

Below 0.5: Speculative or weakly supported

Assign this range when:

The claim is speculative or hypothetical
Evidence comes from a single case report or anecdotal observation
The text explicitly notes uncertainty: "it is unclear whether", "remains to be determined"
The association is based on co-occurrence only without mechanistic support
Prophetic examples from patent documents

Examples:

"It is tempting to speculate that compound X might target pathway Y" — TARGETS at 0.35
"A single case report described response to..." — INDICATED_FOR at 0.40
Patent prophetic example: "Compound X is expected to treat disease Y" — INDICATED_FOR at 0.30

Molecular Identifier Detection

Scientific texts frequently contain molecular identifiers that are structurally meaningful but error-prone if reproduced. These include SMILES strings, InChI notation, amino acid sequences, nucleotide sequences, and CAS registry numbers.

Rules for molecular identifiers

Quote the EXACT surrounding text as the context field. Include enough context to identify what the identifier refers to, but quote it verbatim.
Do NOT reproduce the identifier yourself. Molecular notation is brittle — a single transposed character changes the molecule. Instead, note its presence in the entity attributes.

Flag for validation by including in the entity attributes:

{
  "requires_validation": true,
  "notation_type": "SMILES"
}

Supported notation types to detect and flag:
- SMILES — e.g., CC(=O)Oc1ccccc1C(=O)O
- InChI — e.g., InChI=1S/C9H8O4/...
- InChIKey — e.g., BSYNRYMUTXBXSQ-UHFFFAOYSA-N
- amino_acid_sequence — e.g., MVLSPADKTNVKAAWGKVGAH...
- nucleotide_sequence — e.g., ATCGATCGATCG...
- CAS_number — e.g., 50-78-2
- IUPAC_name — long systematic chemical names

When an identifier appears alongside a common name, use the common name as the entity name and note the identifier type:

{
  "name": "aspirin",
  "entity_type": "COMPOUND",
  "attributes": {
    "requires_validation": true,
    "notation_type": "SMILES",
    "notation_present": true
  },
  "context": "Aspirin (CC(=O)Oc1ccccc1C(=O)O) was used as a positive control"
}

Document Type Strategies

Different document types require different extraction approaches. Identify the document type first, then apply the appropriate strategy.

Research Articles (PubMed / bioRxiv)

Priority sections: Abstract, Results, Discussion, Methods (for assay details).

Strategy:

Extract primary findings from the Results section with statistical evidence (p-values, confidence intervals, hazard ratios, effect sizes).
Note whether findings are in vitro, in vivo (animal model), or in human subjects — record this in the entity attributes as evidence_level.
Distinguish between the authors' own data (higher confidence) and referenced prior work (lower confidence unless corroborated).
For multi-panel figures referenced in text, extract the finding described, not the figure itself.
Review articles cited in Discussion for established facts vs. novel claims.

Special considerations:

Preprints (bioRxiv, medRxiv) have not undergone peer review — reduce confidence by ~0.1 compared to peer-reviewed data.
Supplementary materials often contain critical data (Western blots, dose-response curves) referenced from the main text.

Clinical Trial Reports

Priority sections: Study design, Patient population, Primary endpoints, Key secondary endpoints, Safety.

Strategy:

Extract the trial design as CLINICAL_TRIAL entity attributes: phase, randomization, blinding, patient population.
Extract primary endpoint results with exact statistics: HR, ORR, CR rate, DOR, PFS, OS with confidence intervals and p-values.
For safety data, extract Grade 3+ adverse events and any leading to discontinuation.
Capture subgroup analyses when they reveal differential benefit (e.g., PD-L1 high vs. low).
Note if the trial met its primary endpoint (attribute: primary_endpoint_met: true/false).

Special considerations:

Interim analyses carry lower confidence than final analyses.
Post-hoc analyses should receive lower confidence (0.5-0.7 range).
Intention-to-treat vs. per-protocol populations may show different results — note which is reported.

Patent Documents

Priority sections: Examples (experimental data), Claims (scope), Description (mechanism and rationale).

Strategy:

Extract specific compounds from Examples with actual experimental data at higher confidence.
Markush structures describe compound families, not individual molecules — extract the family as a single COMPOUND entity with compound_class attribute, not individual members.
Prophetic examples (described but not actually performed) receive significantly lower confidence (0.3-0.5). Look for hedging language: "is expected to", "would be", "may be prepared by".
Claims define legal scope, not scientific findings — extract claimed indications at 0.4-0.6 confidence.
Distinguish between actually synthesized compounds (higher confidence) and those merely described (lower confidence).

Special considerations:

Patent language is deliberately broad — do not extract every Markush permutation.
Priority dates and filing dates are attributes of the PUBLICATION entity.
Patent families (same invention, different jurisdictions) should be treated as one PUBLICATION.

Review Articles

Priority sections: Body text summarizing consensus, Tables summarizing clinical data, Figures showing pathways.

Strategy:

Extract established consensus findings at moderate-to-high confidence (0.7-0.9), as they represent the field's understanding, not primary data.
Identify and note conflicting findings: when the review presents opposing viewpoints, extract both with appropriate confidence and note the disagreement in context quotes.
Summary tables often contain concentrated extractable data (drug-target-indication triples).
Pathway diagrams described in text are good sources for PARTICIPATES_IN and IMPLICATED_IN relations.

Special considerations:

Systematic reviews and meta-analyses provide stronger evidence than narrative reviews.
Expert opinion sections receive lower confidence than data-driven sections.
Do not extract the review's speculative conclusions at the same confidence as its evidence summary.

Regulatory Documents (FDA Labels, EMA Summaries, Advisory Committee Briefings)

Priority sections: Indications and Usage, Dosage, Warnings and Precautions, Adverse Reactions, Clinical Studies.

Strategy:

Approved indications are high-confidence INDICATED_FOR relations (0.95+).
Black box warnings are high-confidence ADVERSE_EVENT entities and CAUSES relations.
Contraindications are high-confidence CONTRAINDICATED_FOR relations.
Clinical pharmacology sections contain mechanism of action data.
Post-marketing safety data may include newly identified risks.

Special considerations:

Label language has legal and regulatory precision — extract exactly what it states.
"Limitation of use" sections describe what the drug is NOT approved for.
Risk Evaluation and Mitigation Strategies (REMS) indicate serious safety concerns.
Pharmacogenomic information in labels should be extracted as BIOMARKER entities.

Extraction Rules

These rules govern the overall extraction process. Apply them consistently across all document types.

Extract ALL entities matching the 13 defined types. Do not skip entities because they seem minor. A thorough knowledge graph is more valuable than a selective one.
Use entity names for source_entity and target_entity, not IDs or internal references. The name must exactly match the name field of an entity in the entities array.
Only extract explicitly stated relationships. A relation must be supported by the text. If two entities are mentioned in the same sentence but no relationship is stated between them, do not create a relation based on co-occurrence alone.
Do not infer from co-occurrence. "Drug X was administered. Patients experienced nausea." does not establish CAUSES unless the text explicitly attributes the nausea to Drug X (e.g., "Drug X-related nausea was reported in 15% of patients").
Keep context and evidence quotes in the original language of the source text. Do not translate or paraphrase. These quotes serve as provenance and must be verifiable against the source.
Output entity names in English using standard nomenclature, even when the source text is in another language. The context field preserves the original language; the name field uses the canonical English term.
For combination drugs, extract each component as a separate COMPOUND entity and link them with COMBINED_WITH. "FOLFOX" becomes: leucovorin (COMPOUND), fluorouracil (COMPOUND), oxaliplatin (COMPOUND), each linked with COMBINED_WITH.
Prefer specific relation types over ASSOCIATED_WITH. ASSOCIATED_WITH is the fallback for associations that do not fit any other relation type. If you are using ASSOCIATED_WITH more than 10% of the time, reconsider your relation type assignments.
Avoid duplicate entities. If the same real-world entity appears under different names (e.g., "Gleevec" and "imatinib"), create a single entity using the canonical name with aliases as attributes. Do not create two entities for the same molecule.
For negated statements, do not extract affirmative relations. "Drug X did not show efficacy in disease Y" should NOT produce an INDICATED_FOR relation. You may note the negative finding in a lower-confidence ASSOCIATED_WITH with context making the negation clear, or omit it entirely.
Dosage and administration details are attributes of the COMPOUND entity, not separate entities. "Pembrolizumab 200 mg IV Q3W" produces one COMPOUND entity with dosage attributes.
Temporal information (dates, durations, timelines) should be captured as attributes, not entities. "12-week treatment period" is an attribute of the CLINICAL_TRIAL, not a separate entity.
Statistical values (p-values, hazard ratios, confidence intervals) should be captured as attributes of the relation or the relevant entity. Include them in the evidence quote and as structured attributes when possible.
When in doubt about entity type, check the disambiguation rules above. When in doubt about whether to extract, err on the side of extraction with an appropriate confidence score.

Similar Skills

ui-ux-pro-max

57.6k

context7-mcp

Fetches up-to-date documentation from Context7 for libraries and frameworks like React, Next.js, Prisma. Use for setup questions, API references, and code examples.

context7-plugin

51.4k

paypal-integration

Integrates PayPal payments with express checkout, subscriptions, refunds, and IPN. Includes JS SDK for frontend buttons and Python REST API for backend capture.

payment-processing

33.0k

Stats

Stars1

Forks0

Last CommitMar 16, 2026

Actions

View Source View Plugin View on GitHub View README

Drug Discovery Extraction Skill

Output Format

Return a single JSON object matching the DocumentExtraction schema. Do not wrap in markdown code fences. Do not include commentary outside the JSON.

{
  "entities": [
    {
      "name": "imatinib",
      "entity_type": "COMPOUND",
      "attributes": {
        "inn": "imatinib",
        "brand_name": "Gleevec",
        "development_stage": "approved"
      },
      "confidence": 0.95,
      "context": "Imatinib (Gleevec) was approved for the treatment of chronic myeloid leukemia"
    }
  ],
  "relations": [
    {
      "relation_type": "INDICATED_FOR",
      "source_entity": "imatinib",
      "target_entity": "Chronic Myeloid Leukemia",
      "confidence": 0.95,
      "evidence": "Imatinib (Gleevec) was approved for the treatment of chronic myeloid leukemia"
    }
  ]
}

Field Requirements

name: Canonical English name using standard nomenclature (INN for drugs, HGNC for genes, MeSH for diseases, MedDRA for adverse events). Always normalize to the preferred term.
entity_type: One of the 13 defined types (see below). Must be UPPER_SNAKE_CASE.
attributes: A flat dictionary of type-specific metadata. Include all attributes that can be determined from the text. Omit attributes that are not mentioned or cannot be reasonably inferred.
confidence: A float between 0.0 and 1.0 reflecting how explicitly and strongly the text supports the extraction. See Confidence Calibration section.
context: The exact verbatim quote from the source text that supports the entity extraction. Keep it to 1-2 sentences. Do not paraphrase.
relation_type: One of the 22 defined types (see below). Must be UPPER_SNAKE_CASE.
source_entity / target_entity: The name field of a previously extracted entity. Must match exactly.
evidence: The exact verbatim quote from the source text that supports the relation. Keep it to 1-2 sentences. Do not paraphrase.

Entity Types

1. COMPOUND

Description: Small molecules, biologics, drug candidates, approved drugs, and chemical probes.

Naming convention: Prefer International Nonproprietary Names (INN) as the canonical name. Record brand names and compound codes as attributes. Example: use imatinib not Gleevec or STI-571.

Key attributes to capture:

inn — International Nonproprietary Name
brand_name — Marketed trade name(s)
compound_code — Internal or development codes (e.g., BMS-986158, RG7388)
drug_class — Pharmacological class (e.g., tyrosine kinase inhibitor, monoclonal antibody)
development_stage — preclinical, Phase I, Phase II, Phase III, approved, withdrawn
route_of_administration — oral, IV, subcutaneous, etc.
formulation — When relevant (e.g., liposomal, pegylated)

Extraction hints:

Drug names often appear near dose information ("pembrolizumab 200 mg Q3W") — the dose is an attribute, the drug is the entity.
Biologics include monoclonal antibodies (-mab suffix), antibody-drug conjugates (-tuxtan suffix), fusion proteins (-cept suffix), and gene therapies.
When a compound code and an INN both appear, use the INN as the name and the code as an attribute.

2. GENE

Description: Genes, genetic loci, alleles, and genomic variants relevant to drug targets or disease mechanisms.

Naming convention: Use HGNC-approved gene symbols. Example: BRAF, TP53, KRAS. For variants, append the mutation: BRAF V600E, EGFR T790M.

Key attributes to capture:

hgnc_symbol — Official HGNC symbol
variant — Specific mutation or polymorphism (e.g., V600E, exon 19 deletion)
variant_type — missense, nonsense, frameshift, amplification, fusion, deletion
gene_family — Functional family (e.g., receptor tyrosine kinase, tumor suppressor)
organism — Species when relevant (default: Homo sapiens)

Extraction hints:

Use GENE when the text discusses genomic locus, mutation, expression level, polymorphism, or coding sequence.
Gene symbols are typically italicized in papers or written in ALL CAPS — both are gene references.
Fusions like "BCR-ABL" or "EML4-ALK" are GENE entities with variant_type: fusion.

3. PROTEIN

Description: Proteins, enzymes, receptors, ion channels, and protein complexes that serve as drug targets or functional entities.

Naming convention: Use UniProt canonical names or widely accepted names. Example: EGFR (as protein), BCR-ABL kinase, PD-1.

Key attributes to capture:

uniprot_id — UniProt accession when identifiable
protein_class — kinase, GPCR, nuclear receptor, ion channel, transporter, protease
protein_family — Specific family (e.g., ErbB family, JAK family)
domain — Relevant structural domain when discussed
post_translational_modification — phosphorylation, ubiquitination, etc.

Extraction hints:

Use PROTEIN when text discusses binding, catalytic activity, structural domains, protein-protein interactions, or post-translational modification.
Receptor-ligand pairs: both the receptor and the ligand are PROTEIN entities.
Protein complexes (e.g., "proteasome", "PD-1/PD-L1 complex") are single PROTEIN entities.

4. DISEASE

Description: Medical conditions, syndromes, and disease classifications with established diagnostic criteria.

Naming convention: Prefer MeSH disease terms. Example: Non-Small Cell Lung Carcinoma not "NSCLC" (include abbreviations as attributes). Include staging and subtyping.

Key attributes to capture:

mesh_term — MeSH descriptor
icd_code — ICD-10/ICD-11 code when known
disease_subtype — Molecular or histological subtype
stage — Cancer staging, disease severity classification
prevalence — When mentioned (rare disease designation, etc.)
aliases — Common abbreviations and alternative names

Extraction hints:

Include disease subtypes as separate entities when they are treated differently (e.g., "HER2-positive Breast Cancer" is distinct from "Triple-Negative Breast Cancer").
Rare diseases may only be described, not named — create the entity with the described name.
"Cancer" variants: differentiate carcinoma, sarcoma, lymphoma, leukemia when the text specifies.

5. MECHANISM_OF_ACTION

Description: Pharmacological mechanisms by which compounds exert their therapeutic or toxic effects.

Naming convention: Express as a noun phrase describing the mechanism. Example: selective BRAF V600E inhibition, PD-1 checkpoint blockade, PARP trapping.

Key attributes to capture:

mechanism_class — inhibition, activation, blockade, degradation, modulation
selectivity — selective, pan-, dual, multi-targeted
reversibility — reversible, irreversible, slowly reversible
target — The molecular target acted upon

Extraction hints:

A mechanism is distinct from the compound: "nivolumab" is a COMPOUND; "PD-1 checkpoint blockade" is a MECHANISM_OF_ACTION.
Look for phrases like "acts by", "mechanism of action involves", "works through".
Some mechanisms are named paradigms: "synthetic lethality", "antibody-dependent cellular cytotoxicity (ADCC)", "bispecific T-cell engagement".

6. CLINICAL_TRIAL

Description: Specific clinical studies identified by registry ID, study name, or phase designation.

Naming convention: Use the trial's most recognizable identifier. Prefer the acronym/name over the NCT number as the primary name. Example: KEYNOTE-024 (with NCT02142738 as attribute).

Key attributes to capture:

nct_id — ClinicalTrials.gov identifier (NCTxxxxxxxx)
trial_acronym — Study name/acronym (e.g., KEYNOTE-024, CheckMate-067)
phase — Phase I, Phase I/II, Phase II, Phase III, Phase IV
design — randomized, open-label, double-blind, crossover, basket, umbrella, platform
primary_endpoint — Primary efficacy endpoint (PFS, OS, ORR, etc.)
status — recruiting, active, completed, terminated
patient_population — Key eligibility criteria mentioned

Extraction hints:

NCT numbers are the gold standard identifier — always capture them.
Basket trials test one drug across multiple tumor types; umbrella trials test multiple drugs in one tumor type — capture the trial design.
"Phase 1/2 dose-escalation study" is a CLINICAL_TRIAL even without a name.

7. PATHWAY

Description: Biological signaling pathways, metabolic pathways, and regulatory cascades.

Naming convention: Use canonical pathway names from KEGG or Reactome when possible. Example: MAPK/ERK signaling pathway, PI3K/AKT/mTOR pathway.

Key attributes to capture:

pathway_database — KEGG, Reactome, WikiPathways identifier when known
pathway_class — signaling, metabolic, regulatory, immune
key_components — Major constituent proteins/genes mentioned

Extraction hints:

Pathways are often described as cascades: "RAS activates RAF which phosphorylates MEK" describes the MAPK pathway.
Distinguish the pathway (PATHWAY) from its components (PROTEIN/GENE entities).
"Signaling" and "cascade" are strong indicators of pathway entities.

8. BIOMARKER

Description: Measurable indicators used for diagnosis, prognosis, patient stratification, or treatment response prediction.

Naming convention: Use the biomarker name with its clinical context. Example: PD-L1 expression (TPS >= 50%), microsatellite instability-high (MSI-H).

Key attributes to capture:

biomarker_type — predictive, prognostic, diagnostic, pharmacodynamic, surrogate
analyte — What is measured (protein, DNA, RNA, metabolite)
assay — Detection method when mentioned (IHC, FISH, NGS, PCR)
threshold — Clinical cutoff value
tissue — Sample type (tumor, blood, CSF)

Extraction hints:

A BIOMARKER is distinguished from a GENE or PROTEIN by its clinical use as a measurable indicator for decision-making.
"Companion diagnostic" always implies a BIOMARKER entity.
Thresholds are critical: "PD-L1 >= 50%" and "PD-L1 >= 1%" may define different patient populations.

9. ADVERSE_EVENT

Description: Drug-induced side effects, toxicities, and safety signals observed in clinical or post-market settings.

Naming convention: Prefer MedDRA Preferred Terms. Example: Hepatotoxicity, QT Prolongation, Immune-Mediated Colitis.

Key attributes to capture:

meddra_pt — MedDRA Preferred Term
severity_grade — CTCAE grade (1-5) when mentioned
frequency — very common (>10%), common (1-10%), uncommon (0.1-1%), rare (<0.1%)
seriousness — serious, non-serious, life-threatening
onset — acute, delayed, chronic
management — dose reduction, discontinuation, supportive care mentioned
organ_class — MedDRA System Organ Class (hepatobiliary, cardiac, dermatologic, etc.)

Extraction hints:

Distinguish drug-induced conditions (ADVERSE_EVENT) from diseases being treated (DISEASE): "drug-induced hepatotoxicity" is an AE; "hepatitis B" is a DISEASE.
Black box warnings and REMS requirements indicate serious AEs.
Grade 3+ events are particularly important to capture.

10. ORGANIZATION

Description: Pharmaceutical companies, biotech firms, regulatory agencies, research institutions, and consortia.

Naming convention: Use the most commonly recognized name. Example: Roche not "F. Hoffmann-La Roche AG"; FDA not "United States Food and Drug Administration".

Key attributes to capture:

organization_type — pharma, biotech, regulatory_agency, academic, CRO, consortium
country — Headquarters country
parent_company — Parent organization if subsidiary

Extraction hints:

Regulatory agencies (FDA, EMA, PMDA, NMPA) are ORGANIZATIONs, not REGULATORY_ACTIONs. The approval decision is the REGULATORY_ACTION.
Academic medical centers often appear as trial sites or discovery institutions.
CROs (Contract Research Organizations) may appear in trial conduct discussions.

11. PUBLICATION

Description: Specific journal articles, clinical study reports, regulatory filings, and patent documents referenced substantively in the text.

Naming convention: Use "First Author et al., Journal Year" format when identifiable. Example: Gandhi et al., NEJM 2018.

Key attributes to capture:

doi — Digital Object Identifier
pmid — PubMed ID
journal — Journal name
year — Publication year
first_author — First author surname
publication_type — original_research, review, meta_analysis, case_report, editorial

Extraction hints:

Only extract publications discussed substantively, not routine background citations.
Landmark trials are often referred to by study name rather than citation — cross-reference with CLINICAL_TRIAL entities.
Patent documents: capture the patent number (e.g., US10,000,000) as a PUBLICATION.

12. REGULATORY_ACTION

Description: Regulatory decisions, approvals, label changes, and safety communications by health authorities.

Naming convention: Describe the action with the agency. Example: FDA accelerated approval (2017), EMA conditional marketing authorization, FDA complete response letter.

Key attributes to capture:

agency — FDA, EMA, PMDA, NMPA, Health Canada, etc.
action_type — approval, accelerated_approval, breakthrough_designation, priority_review, complete_response_letter, withdrawal, label_update, safety_communication, REMS
date — Date of the action
indication — Specific approved indication
conditions — Post-marketing requirements, REMS, confirmatory trial requirements

Extraction hints:

Regulatory actions are events, not the agencies themselves. "FDA" is an ORGANIZATION; "FDA approval of pembrolizumab" is a REGULATORY_ACTION.
Accelerated approvals often have confirmatory trial requirements — capture these as attributes.
"Breakthrough therapy designation" and "orphan drug designation" are REGULATORY_ACTIONs.

13. PHENOTYPE

Description: Observable biological traits, expression states, and patient characteristics used in stratification or mechanistic description.

Naming convention: Use descriptive terms reflecting the observable state. Example: HER2-positive, microsatellite instability, epithelial-mesenchymal transition.

Key attributes to capture:

phenotype_category — molecular, cellular, physiological, morphological
measurement_method — How the phenotype is assessed
associated_genes — Genes driving the phenotype when known
clinical_relevance — stratification, prognosis, mechanism

Extraction hints:

PHENOTYPE describes a biological state; DISEASE describes a classifiable condition. "Insulin resistance" is a PHENOTYPE; "Type 2 Diabetes Mellitus" is a DISEASE.
PHENOTYPE vs BIOMARKER: Use PHENOTYPE for the biological state itself; use BIOMARKER when it is measured as a clinical indicator for a treatment decision.
Cellular phenotypes like "apoptosis", "senescence", "EMT" are PHENOTYPEs.

Relation Types

Drug-Target Relations

TARGETS

Description: Compound acts on a protein or gene target.
Source types: COMPOUND
Target types: PROTEIN, GENE
Extraction guidance: Look for "targets", "acts on", "directed against", "selective for". The compound is always the source. Example: imatinib TARGETS BCR-ABL kinase.

INHIBITS

Description: Entity inhibits, suppresses, or blocks the activity of another entity.
Source types: COMPOUND, PROTEIN
Target types: PROTEIN, GENE, PATHWAY
Extraction guidance: Look for "inhibits", "blocks", "suppresses", "downregulates", "antagonizes". Covers both direct enzyme inhibition and downstream pathway suppression. Example: vemurafenib INHIBITS BRAF V600E.

ACTIVATES

Description: Entity activates, stimulates, or enhances the activity of another entity.
Source types: COMPOUND, PROTEIN
Target types: PROTEIN, GENE, PATHWAY
Extraction guidance: Look for "activates", "stimulates", "upregulates", "agonizes", "enhances", "induces". Includes receptor agonism and pathway activation. Example: GLP-1 ACTIVATES GLP-1 receptor.

BINDS_TO

Description: Physical binding interaction between molecular entities.
Source types: COMPOUND, PROTEIN
Target types: PROTEIN
Extraction guidance: Look for "binds to", "interacts with", "affinity for", "Kd =", "IC50". Use for physical interactions. If the functional consequence is known (inhibition, activation), prefer INHIBITS or ACTIVATES. Example: trastuzumab BINDS_TO HER2.

HAS_MECHANISM

Description: Compound operates through a specific mechanism of action.
Source types: COMPOUND
Target types: MECHANISM_OF_ACTION
Extraction guidance: Look for "mechanism of action", "works by", "acts as a", "MOA is". Links the drug to its pharmacological mechanism. Example: olaparib HAS_MECHANISM PARP trapping.

Drug-Disease Relations

INDICATED_FOR

Description: Compound is indicated for or used to treat a disease.
Source types: COMPOUND
Target types: DISEASE
Extraction guidance: Look for "indicated for", "approved for", "used to treat", "therapeutic use in", "first-line treatment of". Include both approved and investigational indications — note the distinction in confidence scores. Example: pembrolizumab INDICATED_FOR Non-Small Cell Lung Carcinoma.

CONTRAINDICATED_FOR

Description: Compound is contraindicated for a disease or patient population.
Source types: COMPOUND
Target types: DISEASE
Extraction guidance: Look for "contraindicated in", "should not be used in", "avoid in patients with", "not recommended for". This relation carries patient safety implications — assign with care and prefer higher evidence thresholds. Example: methotrexate CONTRAINDICATED_FOR Hepatic Impairment.

Drug-Clinical Relations

EVALUATED_IN

Description: Compound is evaluated or tested in a clinical trial.
Source types: COMPOUND
Target types: CLINICAL_TRIAL
Extraction guidance: Look for "evaluated in", "tested in", "studied in", "enrolled in", "investigated in". Example: pembrolizumab EVALUATED_IN KEYNOTE-024.

CAUSES

Description: Compound causes or is associated with an adverse event.
Source types: COMPOUND
Target types: ADVERSE_EVENT
Extraction guidance: Look for "causes", "resulted in", "adverse event", "side effect", "toxicity", "led to discontinuation due to". Capture frequency and severity in the ADVERSE_EVENT entity attributes. Example: nivolumab CAUSES Immune-Mediated Colitis.

Drug-Drug Relations

DERIVED_FROM

Description: Compound is derived from, a metabolite of, or structurally based on another compound.
Source types: COMPOUND
Target types: COMPOUND
Extraction guidance: Look for "derived from", "analog of", "metabolite of", "prodrug of", "structurally related to", "next-generation". Covers SAR-derived analogs and metabolic products. Example: osimertinib DERIVED_FROM gefitinib (next-generation EGFR inhibitor).

COMBINED_WITH

Description: Compound is used in combination therapy with another compound.
Source types: COMPOUND
Target types: COMPOUND
Extraction guidance: Look for "combined with", "in combination with", "plus", "co-administered with", "backbone of". Common in oncology (doublets, triplets) and infectious disease (combination ART). For regimen names like "FOLFOX", extract each component drug separately. Example: nivolumab COMBINED_WITH ipilimumab.

Biology Relations

ENCODES

Description: Gene encodes a protein product.
Source types: GENE
Target types: PROTEIN
Extraction guidance: Look for "encodes", "codes for", "gene product", "translated into", "the protein encoded by". Usually straightforward when text discusses both gene and protein. Example: BRAF ENCODES BRAF kinase.

PARTICIPATES_IN

Description: Protein or gene participates in a biological pathway.
Source types: PROTEIN, GENE
Target types: PATHWAY
Extraction guidance: Look for "participates in", "involved in", "member of", "component of", "signals through", "downstream of", "upstream of". Example: KRAS PARTICIPATES_IN MAPK/ERK signaling pathway.

IMPLICATED_IN

Description: Gene, protein, pathway, or phenotype is implicated in or contributes to a disease.
Source types: GENE, PROTEIN, PATHWAY, PHENOTYPE
Target types: DISEASE
Extraction guidance: Look for "implicated in", "contributes to", "driver of", "risk factor for", "pathogenic role in", "oncogenic driver in". Covers causal, correlational, and GWAS associations. Example: BRCA1 IMPLICATED_IN Breast Cancer.

CONFERS_RESISTANCE_TO

Description: Gene, protein, or phenotype confers resistance to a compound.
Source types: GENE, PROTEIN, PHENOTYPE
Target types: COMPOUND
Extraction guidance: Look for "resistance to", "refractory to", "insensitive to", "resistance mutation", "acquired resistance through". Resistance mechanisms are critical for clinical decision-making — be precise. Example: EGFR T790M CONFERS_RESISTANCE_TO gefitinib.

Biomarker Relations

PREDICTS_RESPONSE_TO

Description: Biomarker predicts response to a compound or mechanism of action.
Source types: BIOMARKER
Target types: COMPOUND, MECHANISM_OF_ACTION
Extraction guidance: Look for "predicts response to", "predictive biomarker for", "companion diagnostic", "patients with X benefit from", "enrichment biomarker". This relation has direct clinical implications for treatment selection. Example: PD-L1 expression (TPS >= 50%) PREDICTS_RESPONSE_TO pembrolizumab.

DIAGNOSTIC_FOR

Description: Biomarker is used for diagnosis or prognosis of a disease.
Source types: BIOMARKER
Target types: DISEASE
Extraction guidance: Look for "diagnostic for", "prognostic for", "used to diagnose", "marker of disease activity", "correlates with disease progression". Example: prostate-specific antigen (PSA) DIAGNOSTIC_FOR Prostate Cancer.

Organizational Relations

DEVELOPED_BY

Description: Compound or clinical trial is developed or sponsored by an organization.
Source types: COMPOUND, CLINICAL_TRIAL
Target types: ORGANIZATION
Extraction guidance: Look for "developed by", "manufactured by", "sponsored by", "discovered at", "a product of", "in collaboration with". For co-development, create multiple relations. Example: pembrolizumab DEVELOPED_BY Merck.

PUBLISHED_IN

Description: Clinical trial results or compound data are published in a specific publication.
Source types: CLINICAL_TRIAL, COMPOUND
Target types: PUBLICATION
Extraction guidance: Look for "published in", "reported in", "results presented in", "data published in". Only create this relation for substantive findings, not background citations. Example: KEYNOTE-024 PUBLISHED_IN Gandhi et al., NEJM 2018.

GRANTS_APPROVAL_FOR

Description: Regulatory action grants approval or authorization for a compound.
Source types: REGULATORY_ACTION
Target types: COMPOUND
Extraction guidance: Look for "approved", "granted approval", "authorized", "received marketing authorization", "accelerated approval granted to". The regulatory action is the source, the compound is the target. Example: FDA accelerated approval (2017) GRANTS_APPROVAL_FOR pembrolizumab.

Other Relations

EXPRESSED_IN

Description: Gene or protein is expressed in a phenotypic context or tissue.
Source types: GENE, PROTEIN
Target types: PHENOTYPE
Extraction guidance: Look for "expressed in", "overexpressed in", "detected in", "upregulated in", "high expression in". Includes tissue-specific expression and disease-associated expression patterns. Example: PD-L1 EXPRESSED_IN tumor microenvironment.

ASSOCIATED_WITH

Description: General association between entities. This is the fallback relation.
Source types: Any entity type
Target types: Any entity type
Extraction guidance: Use ONLY when no more specific relation type applies. If you find yourself using ASSOCIATED_WITH frequently, re-evaluate whether a more specific relation type fits. Common legitimate uses: linking a disease to a pathway when IMPLICATED_IN does not fit, connecting entities with an unclear causal direction.

Disambiguation Rules

Disambiguation is critical for producing a clean knowledge graph. Apply these rules in order when an entity could plausibly belong to more than one type.

GENE vs PROTEIN

The same symbol (e.g., EGFR, BRAF, TP53) can refer to either the gene or its protein product. Resolve based on context:

Context clue	Assign as	Example
Mutation, variant, allele, polymorphism, expression level, amplification, deletion, methylation	GENE	"EGFR T790M mutation" → GENE
Binding, inhibition, phosphorylation, enzymatic activity, receptor function, structural domain	PROTEIN	"EGFR receptor autophosphorylation" → PROTEIN
Drug target in pharmacological context	PROTEIN	"erlotinib targets EGFR" → PROTEIN
Genomic alteration in diagnostic context	GENE	"EGFR exon 19 deletion detected by NGS" → GENE
Ambiguous: could be either	Default to GENE for genomic/diagnostic contexts, PROTEIN for pharmacological/functional contexts

COMPOUND vs MECHANISM_OF_ACTION

Example text	Entity type	Reasoning
"pembrolizumab"	COMPOUND	A specific drug molecule
"PD-1 checkpoint blockade"	MECHANISM_OF_ACTION	A pharmacological mechanism class
"anti-PD-1 therapy"	MECHANISM_OF_ACTION	Refers to the class, not a specific drug
"anti-PD-1 antibody pembrolizumab"	Both	Extract COMPOUND (pembrolizumab) AND MOA (PD-1 checkpoint blockade)
"BRAF inhibitor"	MECHANISM_OF_ACTION	Drug class, not a specific compound
"vemurafenib, a BRAF inhibitor"	Both	Extract COMPOUND (vemurafenib) AND MOA (selective BRAF inhibition)

DISEASE vs PHENOTYPE

Example text	Entity type	Reasoning
"Non-Small Cell Lung Carcinoma"	DISEASE	Classifiable diagnosis with ICD code
"microsatellite instability"	PHENOTYPE	Observable biological state
"microsatellite instability-high (MSI-H)" used for treatment selection	BIOMARKER	Used as measurable clinical indicator
"insulin resistance"	PHENOTYPE	Observable physiological state
"Type 2 Diabetes Mellitus"	DISEASE	Classifiable condition with diagnostic criteria
"HER2-positive"	PHENOTYPE	Molecular subtype / observable state
"HER2-positive Breast Cancer"	DISEASE	Classifiable disease subtype
"epithelial-mesenchymal transition"	PHENOTYPE	Cellular biological process

ADVERSE_EVENT vs DISEASE

Example text	Entity type	Reasoning
"drug-induced hepatotoxicity"	ADVERSE_EVENT	Caused by a drug
"hepatitis B"	DISEASE	An independent medical condition
"immune-mediated colitis following nivolumab"	ADVERSE_EVENT	Caused by treatment
"ulcerative colitis"	DISEASE	Independent condition
"treatment-emergent hypertension"	ADVERSE_EVENT	Arose from treatment
"essential hypertension"	DISEASE	Pre-existing condition
"tumor lysis syndrome"	ADVERSE_EVENT	Treatment-triggered condition
"renal cell carcinoma"	DISEASE	The malignancy being treated

BIOMARKER vs GENE/PROTEIN

Example text	Entity type	Reasoning
"PD-L1 is a transmembrane protein"	PROTEIN	Discussing biology
"PD-L1 TPS >= 50% predicts pembrolizumab response"	BIOMARKER	Used as clinical decision indicator
"BRCA1 is a tumor suppressor"	GENE	Discussing gene function
"BRCA1 mutation status guides PARP inhibitor selection"	BIOMARKER	Used as treatment selection marker
"TMB-high (>= 10 mut/Mb)"	BIOMARKER	Quantified indicator with threshold

Confidence Calibration

Assign confidence scores based on the strength and directness of textual evidence. These scores should reflect how certain you are that the extraction is correct AND supported by the text.

0.9 - 1.0: Explicit primary findings

Assign this range when:

The entity or relation is directly and unambiguously stated in the text
The finding is a primary result of the study with statistical evidence
Quantitative data supports the claim (hazard ratio, p-value, odds ratio, IC50)
The text uses definitive language: "we demonstrate", "our results show", "was approved for"

Examples:

"Pembrolizumab significantly improved OS vs chemotherapy (HR 0.63, p<0.001)" — INDICATED_FOR at 0.95
"FDA approved pembrolizumab for first-line NSCLC" — GRANTS_APPROVAL_FOR at 0.98
"IC50 of compound X against kinase Y was 3.2 nM" — INHIBITS at 0.95

0.7 - 0.89: Clear secondary findings

Assign this range when:

The finding is clearly stated but is a secondary observation or result
No statistical evidence is provided, but the statement is definitive
The evidence comes from a well-established body of prior work referenced in passing
The text uses confident but non-quantitative language: "is known to", "has been shown to", "consistently demonstrates"

Examples:

"Imatinib, a well-known BCR-ABL inhibitor" — INHIBITS at 0.85 (established fact stated without primary data here)
"Patients with BRCA mutations may benefit from PARP inhibitors" — PREDICTS_RESPONSE_TO at 0.75
"This pathway is involved in tumor progression" — IMPLICATED_IN at 0.80

0.5 - 0.69: Indirect evidence or preclinical data

Assign this range when:

The finding is inferred from indirect evidence
Data is from preclinical (in vitro or animal) studies only
The text uses hedging language: "suggests", "may", "appears to", "could potentially"
The association is observed but causality is not established
The finding is from a small study or preliminary data

Examples:

"In vitro data suggest compound X may inhibit target Y" — INHIBITS at 0.60
"Mouse models indicate potential efficacy in disease Z" — INDICATED_FOR at 0.55
"Preliminary results suggest a possible association" — ASSOCIATED_WITH at 0.55

Below 0.5: Speculative or weakly supported

Assign this range when:

The claim is speculative or hypothetical
Evidence comes from a single case report or anecdotal observation
The text explicitly notes uncertainty: "it is unclear whether", "remains to be determined"
The association is based on co-occurrence only without mechanistic support
Prophetic examples from patent documents

Examples:

"It is tempting to speculate that compound X might target pathway Y" — TARGETS at 0.35
"A single case report described response to..." — INDICATED_FOR at 0.40
Patent prophetic example: "Compound X is expected to treat disease Y" — INDICATED_FOR at 0.30

Molecular Identifier Detection

Rules for molecular identifiers

Quote the EXACT surrounding text as the context field. Include enough context to identify what the identifier refers to, but quote it verbatim.
Do NOT reproduce the identifier yourself. Molecular notation is brittle — a single transposed character changes the molecule. Instead, note its presence in the entity attributes.

Flag for validation by including in the entity attributes:

{
  "requires_validation": true,
  "notation_type": "SMILES"
}

Supported notation types to detect and flag:
- SMILES — e.g., CC(=O)Oc1ccccc1C(=O)O
- InChI — e.g., InChI=1S/C9H8O4/...
- InChIKey — e.g., BSYNRYMUTXBXSQ-UHFFFAOYSA-N
- amino_acid_sequence — e.g., MVLSPADKTNVKAAWGKVGAH...
- nucleotide_sequence — e.g., ATCGATCGATCG...
- CAS_number — e.g., 50-78-2
- IUPAC_name — long systematic chemical names

When an identifier appears alongside a common name, use the common name as the entity name and note the identifier type:

{
  "name": "aspirin",
  "entity_type": "COMPOUND",
  "attributes": {
    "requires_validation": true,
    "notation_type": "SMILES",
    "notation_present": true
  },
  "context": "Aspirin (CC(=O)Oc1ccccc1C(=O)O) was used as a positive control"
}

Document Type Strategies

Different document types require different extraction approaches. Identify the document type first, then apply the appropriate strategy.

Research Articles (PubMed / bioRxiv)

Priority sections: Abstract, Results, Discussion, Methods (for assay details).

Strategy:

Extract primary findings from the Results section with statistical evidence (p-values, confidence intervals, hazard ratios, effect sizes).
Note whether findings are in vitro, in vivo (animal model), or in human subjects — record this in the entity attributes as evidence_level.
Distinguish between the authors' own data (higher confidence) and referenced prior work (lower confidence unless corroborated).
For multi-panel figures referenced in text, extract the finding described, not the figure itself.
Review articles cited in Discussion for established facts vs. novel claims.

Special considerations:

Preprints (bioRxiv, medRxiv) have not undergone peer review — reduce confidence by ~0.1 compared to peer-reviewed data.
Supplementary materials often contain critical data (Western blots, dose-response curves) referenced from the main text.

Clinical Trial Reports

Priority sections: Study design, Patient population, Primary endpoints, Key secondary endpoints, Safety.

Strategy:

Extract the trial design as CLINICAL_TRIAL entity attributes: phase, randomization, blinding, patient population.
Extract primary endpoint results with exact statistics: HR, ORR, CR rate, DOR, PFS, OS with confidence intervals and p-values.
For safety data, extract Grade 3+ adverse events and any leading to discontinuation.
Capture subgroup analyses when they reveal differential benefit (e.g., PD-L1 high vs. low).
Note if the trial met its primary endpoint (attribute: primary_endpoint_met: true/false).

Special considerations:

Interim analyses carry lower confidence than final analyses.
Post-hoc analyses should receive lower confidence (0.5-0.7 range).
Intention-to-treat vs. per-protocol populations may show different results — note which is reported.

Patent Documents

Priority sections: Examples (experimental data), Claims (scope), Description (mechanism and rationale).

Strategy:

Extract specific compounds from Examples with actual experimental data at higher confidence.
Markush structures describe compound families, not individual molecules — extract the family as a single COMPOUND entity with compound_class attribute, not individual members.
Prophetic examples (described but not actually performed) receive significantly lower confidence (0.3-0.5). Look for hedging language: "is expected to", "would be", "may be prepared by".
Claims define legal scope, not scientific findings — extract claimed indications at 0.4-0.6 confidence.
Distinguish between actually synthesized compounds (higher confidence) and those merely described (lower confidence).

Special considerations:

Patent language is deliberately broad — do not extract every Markush permutation.
Priority dates and filing dates are attributes of the PUBLICATION entity.
Patent families (same invention, different jurisdictions) should be treated as one PUBLICATION.

Review Articles

Priority sections: Body text summarizing consensus, Tables summarizing clinical data, Figures showing pathways.

Strategy:

Extract established consensus findings at moderate-to-high confidence (0.7-0.9), as they represent the field's understanding, not primary data.
Identify and note conflicting findings: when the review presents opposing viewpoints, extract both with appropriate confidence and note the disagreement in context quotes.
Summary tables often contain concentrated extractable data (drug-target-indication triples).
Pathway diagrams described in text are good sources for PARTICIPATES_IN and IMPLICATED_IN relations.

Special considerations:

Systematic reviews and meta-analyses provide stronger evidence than narrative reviews.
Expert opinion sections receive lower confidence than data-driven sections.
Do not extract the review's speculative conclusions at the same confidence as its evidence summary.

Regulatory Documents (FDA Labels, EMA Summaries, Advisory Committee Briefings)

Priority sections: Indications and Usage, Dosage, Warnings and Precautions, Adverse Reactions, Clinical Studies.

Strategy:

Approved indications are high-confidence INDICATED_FOR relations (0.95+).
Black box warnings are high-confidence ADVERSE_EVENT entities and CAUSES relations.
Contraindications are high-confidence CONTRAINDICATED_FOR relations.
Clinical pharmacology sections contain mechanism of action data.
Post-marketing safety data may include newly identified risks.

Special considerations:

Label language has legal and regulatory precision — extract exactly what it states.
"Limitation of use" sections describe what the drug is NOT approved for.
Risk Evaluation and Mitigation Strategies (REMS) indicate serious safety concerns.
Pharmacogenomic information in labels should be extracted as BIOMARKER entities.

Extraction Rules

These rules govern the overall extraction process. Apply them consistently across all document types.

Extract ALL entities matching the 13 defined types. Do not skip entities because they seem minor. A thorough knowledge graph is more valuable than a selective one.
Use entity names for source_entity and target_entity, not IDs or internal references. The name must exactly match the name field of an entity in the entities array.
Only extract explicitly stated relationships. A relation must be supported by the text. If two entities are mentioned in the same sentence but no relationship is stated between them, do not create a relation based on co-occurrence alone.
Do not infer from co-occurrence. "Drug X was administered. Patients experienced nausea." does not establish CAUSES unless the text explicitly attributes the nausea to Drug X (e.g., "Drug X-related nausea was reported in 15% of patients").
Keep context and evidence quotes in the original language of the source text. Do not translate or paraphrase. These quotes serve as provenance and must be verifiable against the source.
Output entity names in English using standard nomenclature, even when the source text is in another language. The context field preserves the original language; the name field uses the canonical English term.
For combination drugs, extract each component as a separate COMPOUND entity and link them with COMBINED_WITH. "FOLFOX" becomes: leucovorin (COMPOUND), fluorouracil (COMPOUND), oxaliplatin (COMPOUND), each linked with COMBINED_WITH.
Prefer specific relation types over ASSOCIATED_WITH. ASSOCIATED_WITH is the fallback for associations that do not fit any other relation type. If you are using ASSOCIATED_WITH more than 10% of the time, reconsider your relation type assignments.
Avoid duplicate entities. If the same real-world entity appears under different names (e.g., "Gleevec" and "imatinib"), create a single entity using the canonical name with aliases as attributes. Do not create two entities for the same molecule.
For negated statements, do not extract affirmative relations. "Drug X did not show efficacy in disease Y" should NOT produce an INDICATED_FOR relation. You may note the negative finding in a lower-confidence ASSOCIATED_WITH with context making the negation clear, or omit it entirely.
Dosage and administration details are attributes of the COMPOUND entity, not separate entities. "Pembrolizumab 200 mg IV Q3W" produces one COMPOUND entity with dosage attributes.
Temporal information (dates, durations, timelines) should be captured as attributes, not entities. "12-week treatment period" is an attribute of the CLINICAL_TRIAL, not a separate entity.
Statistical values (p-values, hazard ratios, confidence intervals) should be captured as attributes of the relation or the relevant entity. Include them in the evidence quote and as structured attributes when possible.
When in doubt about entity type, check the disambiguation rules above. When in doubt about whether to extract, err on the side of extraction with an appropriate confidence score.