Help us improve
Share bugs, ideas, or general feedback.
From thinking-frameworks-skills
Fetches arXiv papers by date window and keywords, optionally restricted to categories (cs.LG, stat.ML, etc). Normalizes records for deduplication with other preprint sources.
npx claudepluginhub lyndonkl/claude --plugin thinking-frameworks-skillsHow this skill is triggered — by the user, by Claude, or both
Slash command
/thinking-frameworks-skills:fetch-arxiv-recentThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Fetch arXiv papers submitted within a date window, optionally restricted to specific arXiv categories, and keyword-filter the results. Returns records in the same canonical shape as the other three literature fetchers.
Searches and retrieves arXiv preprints by keywords, authors, IDs, categories, or date ranges, returning structured JSON metadata and PDF links.
Searches and retrieves preprints from arXiv via the Atom API by keywords, authors, IDs, or categories. Returns structured JSON with titles, abstracts, and links.
Fetches and keyword-filters preprints from bioRxiv/medRxiv within a date window. Handles pagination, deduplication, and normalization.
Share bugs, ideas, or general feedback.
Fetch arXiv papers submitted within a date window, optionally restricted to specific arXiv categories, and keyword-filter the results. Returns records in the same canonical shape as the other three literature fetchers.
- [ ] Step 1: Validate inputs (from, to, keywords, optional categories)
- [ ] Step 2: Build the search_query string with date range + categories + keywords
- [ ] Step 3: Page through the API until results exhausted or window endpoint passed
- [ ] Step 4: Parse Atom XML; normalize each entry
- [ ] Step 5: Dedupe by arXiv ID (keep latest version)
- [ ] Step 6: Return matched records + summary
Step 1 — Validate inputs
Required:
from: YYYY-MM-DD, inclusiveto: YYYY-MM-DD, inclusive, to >= fromkeywords: list of strings (case-insensitive substring match against title + abstract)Optional:
categories: list of arXiv category codes — e.g. ["cs.LG", "cs.CL", "stat.ML"]. If omitted, search all of arXiv. Common groupings the caller may want to expose:
cs.LG, cs.CL, cs.CV, cs.AI, cs.NE, stat.MLmath.ST, math.PR, math.OC, stat.ME, stat.APq-bio.QM, q-bio.GN, q-bio.MN, q-bio.NCphysics.data-an, cond-mat.stat-mechThe skill itself does not hard-code these groupings; it accepts whatever categories the caller passes. The orchestrator's source-registry.md can document the groupings the operator cares about.
Reject if window > 31 days (same rule as fetch-preprint-recent — wider windows almost always indicate a misuse).
Step 2 — Build the query string
The arXiv search API accepts a query language with fielded search. Compose:
search_query = (date_clause) AND (category_clause)? AND (keyword_clause)
Date clause (always include — without it you'll get the whole arXiv history):
submittedDate:[YYYYMMDDHHMM TO YYYYMMDDHHMM]
Use from + 0000 and to + 2359 so the window is fully inclusive of both days.
Category clause (optional): join categories with OR.
(cat:cs.LG OR cat:cs.CL OR cat:stat.ML)
Keyword clause: join with OR. Use all: for full-record search (covers title + abstract + authors, which is what we want for a wide-net keyword scan):
(all:"protein language model" OR all:"diffusion model" OR all:gnn)
Multi-word phrases must be double-quoted. Single words don't need quotes.
Final assembled example:
submittedDate:[202605040000 TO 202605102359]
AND (cat:cs.LG OR cat:cs.CL OR cat:stat.ML)
AND (all:"protein language model" OR all:"diffusion model" OR all:transformer)
URL-encode and pass as search_query to:
http://export.arxiv.org/api/query?search_query={URL_ENCODED}&sortBy=submittedDate&sortOrder=descending&start={offset}&max_results=200
sortBy=submittedDate&sortOrder=descending is important — it lets you stop paginating as soon as the result dates fall before from, instead of having to walk the whole result set.
Step 3 — Pagination
arXiv returns 200 records per page (max). After each page:
published date is older than from, stop (the rest are out of window).start by 200, request again.Rate limit: arXiv asks for 1 request every 3 seconds. Sleep between paginated requests. Don't hammer.
Step 4 — Parse Atom XML
The API returns an Atom feed. Each <entry> has:
<id>: full URL like http://arxiv.org/abs/2605.12345v1 — the canonical paper ID is the trailing 2605.12345 (post-2007 format) or arxiv.org/abs/cs.LG/0301001 (legacy format)<title>: paper title (may have newlines + indentation; collapse whitespace)<summary>: abstract (same whitespace caveat)<author><name>: one per author; preserve order<published>: ISO timestamp of v1 submission (use this as date)<updated>: ISO timestamp of latest version (different if revised)<arxiv:primary_category term="cs.LG">: primary category<category term="..."/>: list of all categories<link rel="alternate" type="text/html" href="..."/>: abstract page URL<link title="pdf" rel="related" type="application/pdf" href="..."/>: PDF URL<arxiv:doi>: optional, present once paper is published in a journalIf the Atom parser fails, retry the request once with a 5-second backoff. On second failure, log the error to fetch_errors and skip the page.
Step 5 — Dedupe by arXiv ID
The paginated results may include the same paper twice if a v2 was submitted within the window. Keep the highest version per ID.
The arXiv ID alone (e.g. 2605.12345) is the canonical key — strip the v1/v2 suffix from the URL and use that.
Step 6 — Normalize and return
Same canonical record shape as the other fetchers:
{
"id": "arxiv:2605.12345", // arxiv-prefixed for source clarity
"title": "...",
"authors": ["Smith J", "Doe A", ...],
"abstract": "...",
"date": "2026-05-07", // YYYY-MM-DD parsed from <published>
"server": "arxiv",
"primary_category": "cs.LG",
"categories": ["cs.LG", "stat.ML"],
"version": 2,
"doi": "10.1145/...", // if present, otherwise null
"url": "https://arxiv.org/abs/2605.12345", // abstract page (preferred for digest links)
"pdf_url": "https://arxiv.org/pdf/2605.12345.pdf",
"matched_keywords": ["protein language model"]
}
Apply the keyword filter client-side (the API's all: is full-record OR but doesn't preserve the match-keyword info). For each record, check title + abstract against the keyword list, populate matched_keywords, and drop records that match none (the API's recall is broader than the operator's intent — the skill must filter).
Return summary:
{
"server": "arxiv",
"window": "2026-05-04/2026-05-10",
"categories": ["cs.LG", "cs.CL", "stat.ML"],
"query": "(...full search_query...)",
"fetched_total": 1240,
"matched_total": 18,
"pages_fetched": 7,
"fetch_errors": [],
"records": [ ... ]
}
Cache the raw Atom XML responses (one per page) to .cache/{YYYY-WW}-arxiv-{page}.xml.
Pattern A — CS-only weekly scan: categories=["cs.LG", "cs.CL", "cs.CV", "cs.AI", "stat.ML"]. The default for an ML/CS-leaning watchlist.
Pattern B — Cross-disciplinary scan (CS + quant-bio): categories=["cs.LG", "stat.ML", "q-bio.QM", "q-bio.GN"]. When the operator wants computational-biology preprints from arXiv that bioRxiv may miss.
Pattern C — All of arXiv: categories=None. Maximum recall, maximum noise. Only useful when the keyword filter is very tight.
Pattern D — Track a specific arXiv account: outside this skill's default scope. Add (au:"Smith, J" OR au:"Doe, A") to the search_query as an additional AND clause.
submittedDate:[... TO ...] arXiv returns the entire history of the matching query — millions of records. This is the single most common arXiv-API mistake.<title> and <summary> whitespace. Atom feeds often contain literal \n plus 6 spaces of indentation. Collapse all internal whitespace runs to single spaces before storing.id URL with the canonical ID. <id>http://arxiv.org/abs/2605.12345v2</id> — the canonical ID is 2605.12345. The v2 is the version suffix; strip it for dedupe.cs.LG/0301001. The skill should accept both formats, never assert the modern one. Match ^[a-z-]+\.[A-Z]+/\d+$|^\d{4}\.\d{4,5}$ to validate.all: search is already broad; further synonym expansion silently widens the net. The watchlist is the gate.| Field | Source | Notes |
|---|---|---|
| Endpoint | http://export.arxiv.org/api/query?search_query={q}&... | HTTPS works too but the docs use HTTP |
| Auth | None | Public API. Use a User-Agent header if WebFetch lets you set one. |
| Page size | 200 max (max_results=200) | |
| Rate limit | 1 request / 3 seconds | Sleep between paginated requests |
| Window cap | 31 days (soft); 7 days for weekly digests | |
| Date format | YYYYMMDDHHMM (no dashes, no colons) | Use 0000 for from, 2359 for to |
| ID format | 2605.12345 (post-2007), cs.LG/0301001 (legacy) | Strip vN suffix from the URL; canonical ID excludes version |
| Canonical URL | https://arxiv.org/abs/{id} | Abstract page; preferred for digest links |
| PDF URL | https://arxiv.org/pdf/{id}.pdf | Optional secondary link |
| Sort | sortBy=submittedDate&sortOrder=descending | Required for early-bail pagination |
| Common cats | cs.LG, cs.CL, cs.CV, cs.AI, stat.ML, math.ST, q-bio.QM | Operator decides; skill takes any list |