From sciagent-skills
Uses HuggingFace Transformers with biomedical models (BioBERT, PubMedBERT, BioGPT, BioMedLM) for scientific NLP tasks: NER (genes, diseases, chemicals), relation extraction, QA, classification, summarization, fine-tuning.
npx claudepluginhub jaechang-hits/sciagent-skills --plugin sciagent-skillsThis skill uses the workspace's default tool permissions.
HuggingFace Transformers provides a unified API to load, run, and fine-tune 500+ biomedical language models. The key biomedical models — BioBERT (trained on PubMed abstracts + PMC full text), PubMedBERT (trained from scratch on PubMed), BioGPT (generative, trained on PubMed), and BioMedLM — significantly outperform general-purpose BERT on biomedical NER, relation extraction, and question answer...
Builds NLP applications using transformers, BERT, GPT for text classification, NER, sentiment analysis, summarization, QA, and more in Python.
Autonomously executes complex biomedical research tasks in genomics, drug discovery, molecular biology, and clinical analysis using LLM reasoning, code execution, and integrated databases. For CRISPR screening, scRNA-seq analysis, ADMET prediction, GWAS, rare disease diagnosis.
Routes scientific queries to 105+ specialized ToolUniverse skills covering biology, medicine, genomics, pharmacology or to 2300+ tools for data lookups, analysis, workflows.
Share bugs, ideas, or general feedback.
HuggingFace Transformers provides a unified API to load, run, and fine-tune 500+ biomedical language models. The key biomedical models — BioBERT (trained on PubMed abstracts + PMC full text), PubMedBERT (trained from scratch on PubMed), BioGPT (generative, trained on PubMed), and BioMedLM — significantly outperform general-purpose BERT on biomedical NER, relation extraction, and question answering. The pipeline() abstraction handles tokenization, inference, and postprocessing in one call. Fine-tuning on task-specific labeled data (e.g., BC5CDR for chemical/disease NER) takes under an hour on a single GPU. The datasets library provides direct access to standard biomedical benchmarks.
transformers, torch, datasets, accelerate, sentencepiecepip install transformers torch datasets accelerate sentencepiece
# For GPU (CUDA 11.8)
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu118
from transformers import pipeline
# Named entity recognition with BioBERT
ner = pipeline("ner", model="allenai/scibert_scivocab_cased",
aggregation_strategy="simple")
text = "BRCA1 mutations are associated with increased risk of breast cancer and ovarian cancer."
entities = ner(text)
for ent in entities:
print(f" {ent['word']:20s} {ent['entity_group']:10s} score={ent['score']:.3f}")
Extract biomedical entities using pre-trained NER models.
from transformers import pipeline, AutoTokenizer, AutoModelForTokenClassification
# BioBERT fine-tuned for NER (genes, diseases, chemicals)
# Common choices:
# "allenai/scibert_scivocab_cased" — scientific NER
# "d4data/biomedical-ner-all" — multi-entity biomedical NER
# "pruas/BENT-PubMedBERT-NER-Gene" — gene-specific NER
ner_pipe = pipeline(
"ner",
model="d4data/biomedical-ner-all",
aggregation_strategy="simple", # merge subword tokens into words
device=-1 # -1=CPU, 0=GPU
)
abstracts = [
"Imatinib inhibits the BCR-ABL1 tyrosine kinase and is first-line treatment for CML.",
"EGFR mutations in non-small cell lung cancer predict response to erlotinib.",
]
for text in abstracts:
entities = ner_pipe(text)
print(f"\nText: {text[:60]}...")
for e in entities:
print(f" [{e['entity_group']}] '{e['word']}' (score={e['score']:.2f})")
# Manual tokenization + inference for batch processing
from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch
model_name = "allenai/scibert_scivocab_cased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)
model.eval()
text = "Metformin activates AMPK and reduces hepatic glucose production in type 2 diabetes."
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
with torch.no_grad():
outputs = model(**inputs)
logits = outputs.logits # shape: (1, seq_len, n_labels)
predictions = logits.argmax(dim=-1)[0]
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
labels = [model.config.id2label[p.item()] for p in predictions]
for token, label in zip(tokens[1:-1], labels[1:-1]): # skip [CLS] and [SEP]
if label != "O":
print(f" {token:20s} {label}")
Classify biomedical abstracts or sentences.
from transformers import pipeline
# Zero-shot classification — no fine-tuning needed
zs_clf = pipeline("zero-shot-classification",
model="facebook/bart-large-mnli",
device=-1)
abstract = """
This randomized controlled trial evaluated the efficacy of pembrolizumab versus
chemotherapy in patients with advanced non-small-cell lung cancer. Overall survival
was significantly improved in the pembrolizumab arm (HR=0.60, 95% CI 0.41-0.89).
"""
candidate_labels = ["clinical trial", "basic research", "meta-analysis", "review"]
result = zs_clf(abstract, candidate_labels)
print("Zero-shot classification:")
for label, score in zip(result["labels"], result["scores"]):
print(f" {label:20s}: {score:.3f}")
# Fine-tuned sentiment/outcome classification
from transformers import pipeline
# Example: classify clinical outcome sentiment
clf = pipeline("text-classification",
model="pruas/BENT-PubMedBERT-NER-Gene", # use appropriate task-specific model
device=-1)
sentences = [
"Treatment significantly improved overall survival (p<0.001).",
"No statistically significant difference was observed between groups.",
]
results = clf(sentences)
for sent, result in zip(sentences, results):
print(f" [{result['label']} | {result['score']:.2f}] {sent[:50]}...")
Extract answers from biomedical text passages.
from transformers import pipeline
# Extractive QA: find answer span within context
qa_pipe = pipeline(
"question-answering",
model="sultan/BioM-ELECTRA-Large-SQuAD2", # biomedical QA model
device=-1
)
context = """
BRCA1 is a tumor suppressor gene located on chromosome 17q21. Pathogenic variants
in BRCA1 confer a lifetime breast cancer risk of 50-72% and ovarian cancer risk
of 44-46%. BRCA1 protein functions in DNA double-strand break repair via
homologous recombination.
"""
questions = [
"What chromosome is BRCA1 located on?",
"What is the lifetime breast cancer risk from BRCA1 variants?",
"What DNA repair pathway does BRCA1 participate in?",
]
for q in questions:
result = qa_pipe(question=q, context=context)
print(f"Q: {q}")
print(f"A: {result['answer']} (score={result['score']:.3f})\n")
Generate biomedical text, hypotheses, and summaries.
from transformers import AutoTokenizer, BioGptForCausalLM
import torch
model_name = "microsoft/biogpt"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = BioGptForCausalLM.from_pretrained(model_name)
model.eval()
prompt = "The role of VEGF in tumor angiogenesis"
inputs = tokenizer(prompt, return_tensors="pt")
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=100,
num_beams=5,
early_stopping=True,
no_repeat_ngram_size=3,
pad_token_id=tokenizer.eos_token_id,
)
generated = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f"Generated:\n{generated}")
Embed biomedical text for similarity search and clustering.
from transformers import AutoTokenizer, AutoModel
import torch
import numpy as np
def mean_pooling(model_output, attention_mask):
"""Mean pooling across token embeddings."""
token_embeddings = model_output.last_hidden_state
input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
return (token_embeddings * input_mask_expanded).sum(1) / input_mask_expanded.sum(1)
# PubMedBERT for biomedical sentence embeddings
model_name = "microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)
model.eval()
sentences = [
"BRCA1 is involved in DNA double-strand break repair.",
"Homologous recombination requires BRCA1 and BRCA2.",
"Metformin inhibits hepatic gluconeogenesis via AMPK.",
]
inputs = tokenizer(sentences, padding=True, truncation=True,
max_length=512, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
embeddings = mean_pooling(outputs, inputs["attention_mask"])
embeddings = torch.nn.functional.normalize(embeddings, p=2, dim=1).numpy()
# Compute cosine similarity
from numpy.linalg import norm
sim_01 = np.dot(embeddings[0], embeddings[1])
sim_02 = np.dot(embeddings[0], embeddings[2])
print(f"Similarity (BRCA1 repair vs. HR): {sim_01:.3f}")
print(f"Similarity (BRCA1 repair vs. Metformin): {sim_02:.3f}")
Fine-tune a biomedical model on a labeled NER dataset.
from transformers import (AutoTokenizer, AutoModelForTokenClassification,
TrainingArguments, Trainer, DataCollatorForTokenClassification)
from datasets import Dataset
import numpy as np
# Example: minimal NER fine-tuning setup
model_name = "microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract"
label_list = ["O", "B-GENE", "I-GENE", "B-DISEASE", "I-DISEASE"]
id2label = {i: l for i, l in enumerate(label_list)}
label2id = {l: i for i, l in enumerate(label_list)}
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(
model_name, num_labels=len(label_list), id2label=id2label, label2id=label2id
)
# Training arguments
training_args = TrainingArguments(
output_dir="./biomed_ner_finetuned",
num_train_epochs=3,
per_device_train_batch_size=16,
per_device_eval_batch_size=32,
warmup_steps=100,
weight_decay=0.01,
logging_dir="./logs",
evaluation_strategy="epoch",
save_strategy="epoch",
load_best_model_at_end=True,
)
print(f"Model ready for fine-tuning: {model_name}")
print(f"Labels: {label_list}")
# trainer = Trainer(model=model, args=training_args, ...)
# trainer.train()
Biomedical text contains special tokens (gene symbols, drug names, chemical SMILES, numeric values) that WordPiece and BPE tokenizers split unexpectedly. For example, "BRCA1" → ["BR", "##CA", "##1"]. This subword splitting does not affect classification tasks but does affect NER — use aggregation_strategy="simple" or "first" in pipeline() to merge subword predictions back to word level.
NER uses BIO (Begin-Inside-Outside) tagging: B-GENE marks the first token of a gene name, I-GENE marks continuation tokens, O marks non-entity tokens. During fine-tuning, align labels to subword tokens by setting non-first subword labels to -100 (ignored by the loss function).
from transformers import pipeline
import pandas as pd
ner_pipe = pipeline("ner", model="d4data/biomedical-ner-all",
aggregation_strategy="simple", device=-1)
abstracts = [
"Pembrolizumab combined with chemotherapy significantly improved progression-free survival in HER2-positive breast cancer.",
"Inhibition of EGFR by gefitinib is effective in patients with activating EGFR mutations in exons 19 and 21.",
"CRISPR-Cas9 editing of the PCSK9 gene in hepatocytes reduces LDL cholesterol in murine models.",
]
records = []
for i, text in enumerate(abstracts):
entities = ner_pipe(text)
for e in entities:
records.append({
"abstract_id": i,
"entity": e["word"],
"type": e["entity_group"],
"score": round(e["score"], 3),
})
df = pd.DataFrame(records)
print(df.groupby("type")["entity"].apply(list).to_string())
df.to_csv("extracted_entities.csv", index=False)
print(f"\nExtracted {len(df)} entity mentions across {len(abstracts)} abstracts")
from transformers import AutoTokenizer, AutoModel
import torch
import numpy as np
model_name = "microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)
model.eval()
def embed(texts):
enc = tokenizer(texts, padding=True, truncation=True,
max_length=512, return_tensors="pt")
with torch.no_grad():
out = model(**enc)
vecs = out.last_hidden_state[:, 0, :] # [CLS] token
return torch.nn.functional.normalize(vecs, dim=1).numpy()
query = "CRISPR base editing for correction of point mutations in genetic disease"
corpus = [
"Base editing enables precise single-base changes in genomic DNA without double-strand breaks.",
"CAR-T cell therapy targets CD19 in B-cell acute lymphoblastic leukemia.",
"Prime editing uses reverse transcriptase to install targeted edits at specific loci.",
"RNA interference silences gene expression via RISC-mediated mRNA cleavage.",
]
q_emb = embed([query])
c_emb = embed(corpus)
scores = (q_emb @ c_emb.T).flatten()
ranked = sorted(zip(scores, corpus), reverse=True)
print("Top results:")
for score, text in ranked:
print(f" [{score:.3f}] {text[:70]}...")
| Parameter | Module/Function | Default | Range / Options | Effect |
|---|---|---|---|---|
model | pipeline() | — | HuggingFace model ID string | Pre-trained model to load; must match task |
aggregation_strategy | NER pipeline | "none" | "none", "simple", "first", "average" | Merge subword NER predictions; use "simple" for word-level output |
device | pipeline() | -1 | -1 (CPU), 0 (GPU 0), 1 (GPU 1) | Inference device |
max_length | tokenizer | 512 | 128–2048 (model-dependent) | Max token length; truncates longer inputs |
max_new_tokens | model.generate() | 20 | 1–1000 | Tokens to generate for text generation models |
num_beams | model.generate() | 1 | 1–10 | Beam search width; larger = better quality, slower |
num_train_epochs | TrainingArguments | 3 | 1–10 | Fine-tuning epochs |
per_device_train_batch_size | TrainingArguments | 8 | 4–32 | Batch size per GPU; reduce if OOM |
weight_decay | TrainingArguments | 0.0 | 0.01–0.1 | L2 regularization for fine-tuning |
Use domain-specific models, not general BERT: PubMedBERT trained from scratch on PubMed outperforms BERT-base by 5–15% on biomedical NER. Always start with biomedical pre-training before fine-tuning on task-specific data.
Verify model licenses before production use: Some models (BioGPT, BioMedLM) have research-only licenses. Check the HuggingFace model card's license field before deploying in commercial applications.
Use aggregation_strategy="simple" for word-level NER output: The default "none" returns subword tokens, making post-processing difficult. "simple" merges subword tokens using the first-token strategy.
Truncate at sentence boundaries, not mid-sentence: Long biomedical abstracts that exceed 512 tokens should be split at sentence boundaries before encoding. Mid-sentence truncation degrades NER accuracy for entities near the cutoff.
from transformers import pipeline
from itertools import product
ner = pipeline("ner", model="d4data/biomedical-ner-all",
aggregation_strategy="simple", device=-1)
def extract_drug_disease_pairs(text):
entities = ner(text)
drugs = [e["word"] for e in entities if e["entity_group"] in ("DRUG", "CHEMICAL")]
diseases = [e["word"] for e in entities if e["entity_group"] in ("DISEASE", "CONDITION")]
return list(product(drugs, diseases))
text = "Imatinib and nilotinib both target BCR-ABL1 in chronic myeloid leukemia and Philadelphia chromosome-positive ALL."
pairs = extract_drug_disease_pairs(text)
print("Drug-Disease pairs:")
for drug, disease in pairs:
print(f" {drug} → {disease}")
from transformers import pipeline
clf = pipeline("zero-shot-classification",
model="facebook/bart-large-mnli", device=-1)
abstracts = [
"We present a phase 3 randomized controlled trial of semaglutide in type 2 diabetes.",
"Structural analysis of the SARS-CoV-2 spike protein RBD domain by cryo-EM.",
"A retrospective cohort study of 1,200 ICU patients during the COVID-19 pandemic.",
]
label_options = ["randomized controlled trial", "observational study", "structural biology", "computational study"]
for abstract in abstracts:
result = clf(abstract, label_options)
print(f"Type: {result['labels'][0]} ({result['scores'][0]:.2f})")
print(f" {abstract[:70]}...\n")
| Problem | Cause | Solution |
|---|---|---|
CUDA out of memory during inference | Batch too large for GPU VRAM | Reduce batch size; use device=-1 for CPU; use model.half() for FP16 |
NER returns subword tokens (##CA) | aggregation_strategy not set | Set aggregation_strategy="simple" in pipeline() |
| Model download times out | Large model files (1–10 GB); slow connection | Set HF_HUB_OFFLINE=1 and download manually with huggingface-cli download |
| NER misses entities at end of long abstracts | Input truncated at 512 tokens | Split abstracts into sentences; process each separately |
Fine-tuning loss is NaN | Learning rate too high or gradient explosion | Reduce learning_rate to 2e-5; enable gradient clipping max_grad_norm=1.0 |
| Wrong entities for specialized domain | Generic biomedical model not suited to subdomain | Fine-tune on domain-labeled data; use more specific model (e.g., gene-only NER) |
| BioGPT generates repetitive text | no_repeat_ngram_size too small | Set no_repeat_ngram_size=3 or 4; increase num_beams |
pubmed-database — retrieve PubMed abstracts that serve as input to biomedical NLP pipelinesbiorxiv-database — retrieve preprints for NLP analysis before peer reviewscientific-critical-thinking — evaluate quality of NLP-extracted evidence before using for research conclusions