From research-papers
Extracts implementation-focused notes from scientific paper PDFs by converting pages to images and reading them. Handles <=50pp directly or chunks larger papers. Outputs structured notes to papers/ directory.
npx claudepluginhub ctoth/research-papers-plugin --plugin research-papersThis skill uses the workspace's default tool permissions.
Read a scientific paper and create comprehensive implementation-focused notes.
Extracts structured notes from academic papers via three escalating passes: inspectional skim with Five Cs, content grasp skipping proofs, deep re-implementation. Domain-neutral for bio, CS, ML, any field.
Processes research paper PDFs from local paths, URLs, or arXiv; extracts metadata, content, links; generates study materials in user's language for deep analysis.
Deeply analyzes specific research papers: dissects experimental setups, extracts key numbers, evaluates claims against hypotheses. For arXiv IDs and deep-dive requests.
Share bugs, ideas, or general feedback.
Read a scientific paper and create comprehensive implementation-focused notes.
This skill is a checklist, not an outcome sketch.
Read Image in Claude Code or view_image in Codex) to inspect pngs/page-*.png.page-000.png and the platform refused or failed.The target output in this repo is a dense paper surrogate, not a sharpened executive summary.
Paper extraction is high-stakes and context-heavy. If you dispatch any subagent for reading, chunk extraction, synthesis, abstract extraction, citations extraction, or end-to-end paper processing:
The command examples below use scripts/... paths that are relative to this skill's directory. Resolve them against the installed skill location, not the user's project root.
If the argument is a directory, use it directly. If it's a PDF file, use paper_hash.py lookup to find a matching paper directory:
paper_path="$ARGUMENTS"
if [ -d "$paper_path" ]; then
paper_dir="$paper_path"
elif [ -f "$paper_path" ]; then
HASH_SCRIPT="scripts/paper_hash.py"
if [ -f "$HASH_SCRIPT" ]; then
paper_dir=$(python3 "$HASH_SCRIPT" --papers-dir papers/ lookup "$(basename "$paper_path" .pdf)" 2>/dev/null)
[ $? -ne 0 ] && paper_dir=""
[ -n "$paper_dir" ] && paper_dir="papers/$paper_dir"
else
basename=$(basename "$paper_path" .pdf)
paper_dir=$(ls -d papers/*/ 2>/dev/null | grep -i "${basename%_*}" | head -1)
fi
fi
Continue to Step 1.
ls -la "$paper_dir"/*.md 2>/dev/null
ls "$paper_dir"/pngs/ 2>/dev/null | head -3
ls "$paper_dir"/*.pdf 2>/dev/null | head -1
notes.md? Incomplete — continue to Step 1.notes.md, but paper.pdf and pngs/page-000.png already exist? This is a rerun/regeneration case. Do not rename or move files. Reuse the existing paper directory, inspect the existing page images directly, and continue to Step 1 with the existing assets.notes.md, paper.pdf exists, but pngs/ is missing or incomplete? Regenerate pngs/ from the existing paper.pdf, then continue.rm "$paper_path"). NEVER delete a PDF that lives inside its own paper directory (e.g., papers/Author_Year/paper.pdf). Report "Already complete," stop.First determine the working PDF path:
paper_path="$ARGUMENTS"
if [ -d "$paper_path" ]; then
paper_dir="$paper_path"
work_pdf="$paper_dir/paper.pdf"
else
work_pdf="$paper_path"
fi
If you are in a rerun/regeneration case and "$paper_dir"/pngs/page-000.png already exists, reuse the existing page images. Do not reconvert just because notes.md is missing.
Get page count:
pdfinfo "$work_pdf" 2>/dev/null | grep Pages || echo "pdfinfo not available"
If pngs/page-000.png does not already exist, extract page 0 first in a temp dir for metadata:
tmpdir=$(mktemp -d)
magick -density 150 "$work_pdf[0]" -quality 90 -resize '1960x1960>' "$tmpdir/page0.png"
Read either the existing pngs/page-000.png or $tmpdir/page0.png to extract author, year, and title. Determine directory name: LastName_Year_2-4WordTitle (e.g., Mack_2021_AccessibilityResearchSurvey).
For a new paper, set:
paper_dir="./papers/Author_Year_ShortTitle"
If this is a new paper, create the output directory and convert all pages:
mkdir -p "$paper_dir/pngs"
# Move source PDF into paper directory — skip if already there
if [ "$(realpath "$work_pdf")" != "$(realpath "$paper_dir/paper.pdf")" ]; then
mv "$work_pdf" "$paper_dir/paper.pdf" # MUST be mv, NEVER cp
fi
magick -density 150 "$paper_dir/paper.pdf" -quality 90 -resize '1960x1960>' "$paper_dir/pngs/page-%03d.png"
rm -rf "$tmpdir"
If this is an existing paper directory with paper.pdf present but missing/incomplete pngs/, regenerate:
mkdir -p "$paper_dir/pngs"
magick -density 150 "$paper_dir/paper.pdf" -quality 90 -resize '1960x1960>' "$paper_dir/pngs/page-%03d.png"
rm -rf "$tmpdir"
CRITICAL: Use mv, NEVER cp. The root-level PDF must be removed. A PDF left in papers/ root is indistinguishable from an unprocessed paper.
CRITICAL: Never write temp files to papers/ root. Use mktemp -d for temp work.
Count pages:
ls "$paper_dir"/pngs/page-*.png | wc -l
Decision:
Before long extraction, inspect page-000.png from the paper's pngs/ directory using the platform's local image-reading capability (for example, Read Image in Claude Code or view_image in Codex).
page-000.png and the platform prevented it.Once page-000.png is visible, continue immediately to Step 2A or Step 2B.
CRITICAL: Read EVERY page image. No skipping, no sampling, no "reading enough to get the gist." Read every single page-NNN.png file from page-000 through the last page. If you have 34 pages, read 34 page images. Agents routinely skip pages to save tokens — this produces incomplete notes that miss equations, parameters, and key details buried in middle sections. The entire point of reading the paper is completeness. If you skip pages, the notes are worthless.
For papers with 50 pages or fewer, the assigned worker must do this reading itself. Do not dispatch additional readers for a small paper.
Take thorough notes as you go. Continue to Step 3.
Split into 50-page chunks. Calculate ranges:
Write to ./prompts/paper-chunk-reader.md:
# Task: Read Paper Chunk and Extract Notes
## Context
You are reading a chunk of [PAPER TITLE] being processed in parallel.
Page images: `./papers/Author_Year_ShortTitle/pngs/page-NNN.png`
Use the strongest available full-size model for this job. Do not use any mini/small tier model.
## Your Chunk
**START_PAGE** to **END_PAGE** (inclusive)
Read each page image in your range. Be exhaustive — extract EVERY equation, parameter, algorithm step, implementation detail, limitation, criticism of prior work, and design rationale. Do not summarize away formal content or skip "minor" material. **Tag every finding with its page number** using *(p.N)* notation — downstream claim extraction depends on this.
## Output Format
Write DIRECTLY to `./papers/Author_Year_ShortTitle/chunks/chunk-STARTPAGE-ENDPAGE.md`:
# Pages START-END Notes
## Chapters/Sections Covered
## Key Findings
## Equations Found (LaTeX)
## Parameters Found (table)
## Rules/Algorithms
## Figures of Interest
## Quotes Worth Preserving
## Implementation Notes
## CRITICAL: Parallel Swarm Awareness
You are running alongside other chunk readers.
- Only write to YOUR chunk file in the chunks/ directory
- NEVER use git restore/checkout/reset/clean
mkdir -p "./papers/Author_Year_ShortTitle/chunks"
If you can dispatch parallel subagents, launch one per chunk simultaneously. Each reads its page range and writes to chunks/chunk-START-END.md. Use the strongest available full-size model for every chunk worker. Never use a mini/small tier worker for chunk extraction.
Do not dispatch chunk workers until you have successfully inspected at least one local page image from this paper yourself. If you cannot inspect even page-000.png, that is a concrete blocker and you should stop there.
If parallel dispatch is not available, process each chunk sequentially yourself.
Read all chunks/chunk-*.md files and synthesize into notes.md. Merge, deduplicate, and organize into the format from Step 3. Preserve detail; synthesis should reorganize and deduplicate, not compress the paper into sparse abstractions. If you can dispatch a synthesis subagent, do so using the strongest available full-size model; otherwise do it yourself.
Continue to Step 3.
Be exhaustive. Extract every equation, every parameter, every algorithm, every stated limitation, every criticism of prior work, and every explicit design choice the authors justify. The goal is that someone implementing this paper never needs to open the PDF. More detail is better than elegant compression.
Write to ./papers/Author_Year_ShortTitle/notes.md:
---
title: "[Full Paper Title]"
authors: "[All authors]"
year: [Year]
venue: "[Journal/Conference/Thesis]"
doi_url: "[If available]"
---
# [Full Paper Title]
## One-Sentence Summary
[What this paper provides for implementation - be specific]
## Problem Addressed
[What gap or issue does this paper solve?]
## Key Contributions
- [Contribution 1]
- [Contribution 2]
## Study Design (empirical papers)
- **Type:** [RCT / cohort / case-control / meta-analysis / systematic review / cross-sectional / etc.]
- **Population:** [N, demographics, inclusion/exclusion criteria] *(p.N)*
- **Intervention(s):** [what was administered, dosage, duration, route] *(p.N)*
- **Comparator(s):** [placebo, active control, standard of care] *(p.N)*
- **Primary endpoint(s):** [what was measured as the main outcome] *(p.N)*
- **Secondary endpoint(s):** [additional outcomes] *(p.N)*
- **Follow-up:** [duration, completeness, dropout rates] *(p.N)*
*Leave this section empty for non-empirical papers (pure theory, algorithms, proofs).*
## Methodology
[High-level description of approach — experimental design, computational method, analytical framework, etc.]
## Key Equations / Statistical Models
$$
[equation in LaTeX]
$$
Where: [variable definitions with units]
*(p.N)*
*Include statistical models (regression specifications, survival models, Bayesian priors) alongside mathematical equations. For clinical papers, capture the primary analysis model even if not presented in formal notation.*
## Parameters
| Name | Symbol | Units | Default | Range | Page | Notes |
|------|--------|-------|---------|-------|------|-------|
*Capture every measurable quantity: physical constants, algorithm thresholds, dosages, sample sizes, hazard ratios, odds ratios, confidence intervals, p-value thresholds, effect sizes — whatever the paper's domain uses.*
## Effect Sizes / Key Quantitative Results
| Outcome | Measure | Value | CI | p | Population/Context | Page |
|---------|---------|-------|----|---|--------------------|------|
*One row per reported effect. Use for any empirical paper — clinical trials, A/B tests, benchmarks, ablation studies. Measure column: HR, OR, RR, RD, ATE, Cohen's d, accuracy, F1, BLEU, etc. Include both primary and subgroup results.*
## Methods & Implementation Details
- Study protocol / experimental setup *(p.N)*
- Statistical methods and software used *(p.N)*
- Data structures / algorithms needed *(p.N)*
- Initialization procedures / calibration *(p.N)*
- Edge cases / sensitivity analyses *(p.N)*
- Pseudo-code if provided *(p.N)*
- Adverse events / safety monitoring (clinical papers) *(p.N)*
## Figures of Interest
- **Fig N (p.X):** [What it shows]
## Results Summary
[Key findings — performance characteristics, clinical outcomes, effect magnitudes, statistical significance] *(p.N)*
## Limitations
[What authors acknowledge doesn't work] *(p.N)*
## Arguments Against Prior Work
- [What specific prior approaches does this paper criticize?] *(p.N)*
- [What failure modes or limitations of prior work does it identify?] *(p.N)*
- [What evidence does it present for the criticism?] *(p.N)*
## Design Rationale
- [What architectural choices does this paper justify?] *(p.N)*
- [What alternatives were considered and why were they rejected?] *(p.N)*
- [What properties does the chosen design preserve that alternatives don't?] *(p.N)*
## Testable Properties
- [Property 1: e.g., "Parameter X must be in [low, high]"] *(p.N)*
- [Property 2: e.g., "Increasing A must increase B"] *(p.N)*
- [Property 3: e.g., "Treatment effect HR < 1.0 for primary endpoint"] *(p.N)*
- [Property 4: e.g., "NNT for outcome Y = Z over N years"] *(p.N)*
- [Property 5: e.g., "Subgroup analysis shows effect modification by age"] *(p.N)*
## Relevance to Project
[How this paper applies to the project's research domain]
## Open Questions
- [ ] [Unclear aspects]
## Related Work Worth Reading
- [Papers cited worth following]
title, yearauthors, venue, doi_urlpages, affiliation, affiliations, institution, publisher, supervisor, supervisors, funding, pacs, note, correction_doi, citationauthor, doi, url, journal, type, paperWrite ./papers/Author_Year_ShortTitle/metadata.json.
Use this schema and fill every field you can from the paper/frontmatter:
{
"title": "Full Paper Title",
"authors": ["Author One", "Author Two"],
"year": "2024",
"arxiv_id": null,
"doi": "10.xxxx/xxxxx",
"abstract": "Exact or near-exact abstract text",
"url": null,
"pdf_url": null
}
Rules:
title, authors, and year are required.authors must be a JSON array, not a single string.null for unknown fields rather than omitting them.doi should be the DOI string without https://doi.org/ when possible.arxiv_id.| Name | Symbol | Units | Default | Range | Notes |
|---|---|---|---|---|---|
| Fundamental frequency | F0 | Hz | 120 | 60-500 | Male speaker baseline |
| Aspirin dose | — | mg/day | 100 | 75-325 | Low-dose range |
| Hazard ratio (MACE) | HR | — | 0.89 | 0.77-1.03 | Primary composite endpoint |
| Learning rate | α | — | 0.001 | 1e-5–0.1 | Adam optimizer |
Rules:
- for dimensionless ratios/rates.X-Y for ranges. For effect sizes, the point estimate goes in Default, the CI goes in Range.If a parameter varies by context, create one table per context (e.g., "Modal Voice Parameters", "Breathy Voice Parameters", "Age ≥75 Subgroup", "Intention-to-Treat Analysis").
DO NOT use matrix format (parameters as columns, contexts as rows). The extractor expects parameters as rows.
Measurement/data tables use descriptive headers with units in parentheses: F1 (Hz), Duration (ms), HR (95% CI).
$$ block$$ blocksEvery finding must include its page number. You are reading page images — you know which page you are on. Tag every equation, parameter, key finding, definition, and testable property with *(p.N)* where N is the page number. This is not optional — downstream claim extraction depends on page provenance to produce valid claims. A finding without a page number is a finding that cannot be traced back to the source.
*(p.12)* after the Where: blockPage column in the parameter table*(p.N)* inline*(p.N)* at end of each bullet*(p.N)* at end of each bullet(p.X) format — keep doing thisWrite ./papers/Author_Year_ShortTitle/description.md:
---
tags: [tag1, tag2, tag3]
---
[Sentence 1: What the paper does/presents]
[Sentence 2: Key findings/contributions]
[Sentence 3: Relevance to this project's research domain]
Single paragraph, no blank lines between sentences. Tags: 2-5, lowercase, hyphens for multi-word, prefer existing tags from papers/index.md.
Write ./papers/Author_Year_ShortTitle/abstract.md:
# Abstract
## Original Text (Verbatim)
[Exact abstract text from the paper]
---
## Our Interpretation
[2-3 sentences: What problem? Key finding? Why relevant?]
For chunked papers, if you can dispatch a subagent for this extraction, do so using pngs/page-000.png and the strongest available full-size model. Do not use a fast/mini/small model here. Otherwise, read pngs/page-000.png yourself and write abstract.md.
Write ./papers/Author_Year_ShortTitle/citations.md:
# Citations
## Reference List
[Every citation from References/Bibliography, preserving original formatting]
## Key Citations for Follow-up
[3-5 most relevant citations with brief notes on why]
For chunked papers, if you can dispatch a subagent for this extraction, do so using the last 5-10 page images and the strongest available full-size model. Do not use a fast/mini/small model here. Otherwise, read those pages yourself and write citations.md.
Steps 5 and 6 can run in parallel since they write to different files.
Invoke the reconcile skill on papers/Author_Year_ShortTitle if skill invocation is available. Otherwise, follow the reconcile skill instructions directly on that directory. This handles forward/reverse cross-referencing, reconciliation of citing papers, and backward annotations.
If nested skill invocation is unavailable or unreliable on this platform, derive this skill's
installed directory from the injected <path>, then run:
uv run "<skill-dir>/../reconcile/scripts/emit_nested_reconcile_fallback.py"
Read the FULL stdout and follow it exactly on the current paper directory instead of opening
reconcile/SKILL.md piecemeal.
Wait for reconcile to complete before proceeding.
Append:
## Author_Year_ShortTitle (tag1, tag2, tag3)
[description.md body text — no frontmatter, no tags line]
This step is NOT optional. Without it, future sessions won't know this paper exists.
uv run plugins/research-papers/scripts/stamp_provenance.py \
"papers/<Author_Year_ShortTitle>/notes.md" \
--agent "<your model name>" --skill paper-reader
This records which model read the paper, when, and which plugin version was used. Plugin version is autodetected.
All papers produce: papers/Author_Year_Title/ containing notes.md, metadata.json, description.md, abstract.md, citations.md, pngs/, and an updated papers/index.md entry.
Papers >50 pages also produce chunks/.
When done:
Done - created papers/[dirname]/
- index.md updated
- Reconciliation: [summary]
Then provide a brief usefulness assessment in the conversation (not a file):
## Usefulness to This Project
**Rating:** [High/Medium/Low/Marginal]
**What it provides:** [concrete takeaways]
**Actionable next steps:** [what to implement or investigate]
**Skip if:** [when this paper isn't relevant]
Do NOT:
papers/ rootcp instead of mv for the source PDF