Skill

web-archiving

Archives web pages via Wayback Machine, Archive.today, or local scrape for fact-checking and investigations, with chain-of-custody Markdown files.

Bash

Markdown

cli-tools

automation

npx claudepluginhub buriedsignals/skills --plugin spotlight

Tool Access

This skill uses the workspace's default tool permissions.

Preview

> **Adapted from** [jamditis/claude-skills-journalism](https://github.com/jamditis/claude-skills-journalism) by Jay Amditis (MIT License).

SKILL.md

Similar Skills

research-archival

Scrapes AI research from ChatGPT, Gemini, Claude URLs and web pages; archives as Markdown with YAML frontmatter; creates verified GitHub Issues with cross-references.

3 files6 tools

gh-tools

firecrawl-research-patterns

Provides templates for self-hosted Firecrawl research: single searches, academic paper retrieval/routing, recursive deep research with raw corpus persistence.

9 files6 tools

devops-tools

gather

Gathers content from URLs (auto-detects Google/Slack/Notion/GitHub), web searches (Tavily/Exa), and local codebase into markdown artifacts for stable reasoning context.

15 files

cwf

Stats

Parent Repo Stars14

Parent Repo Forks1

Last CommitMar 31, 2026

Actions

View Source View Plugin View on GitHub View README

Help us improve

Share bugs, ideas, or general feedback.

Web Archiving

Adapted from jamditis/claude-skills-journalism by Jay Amditis (MIT License).

Archive evidence sources as you find them, before they disappear. This skill is for investigators and fact-checkers who need to preserve sources for editorial accountability, legal defensibility, and reproducibility.

When to Archive

Before citing any URL — Archive immediately on discovery. Sources vanish. A finding without an archived copy is a finding that can be challenged.
When a source is the only evidence for a claim — Archive it twice (Wayback + Archive.today).
When a source is from a hostile, unstable, or government-controlled domain — Archive before the site detects your activity.
When a page has changed during the investigation window — Archive each version with a timestamp note.

Archive Service Hierarchy

Try in order. Stop when you have a confirmed archived copy.

1. Wayback Machine (Internet Archive)

Check if already archived:

curl -s "https://archive.org/wayback/available?url={URL}" | jq '.archived_snapshots.closest'

Submit for archiving:

curl -s -I "https://web.archive.org/save/{URL}"
# Returns Location header with the new snapshot URL

Find all snapshots (CDX API):

curl "http://web.archive.org/cdx/search/cdx?url={URL}&output=json&limit=5&fl=timestamp,statuscode,original"

2. Archive.today

Submit via form:

curl -s -L -X POST "https://archive.ph/submit/" \
  -H "Content-Type: application/x-www-form-urlencoded" \
  -d "url={URL}" \
  -d "anyway=1"
# Follow redirects — final URL is the archived copy

Check for existing copy:

curl -s "https://archive.ph/newest/{URL}"

3. Local Scrape (Fallback)

When archive services are rate-limited or unavailable, scrape directly using your configured search library and save to the evidence store. Treat local scrapes as lower-confidence preservation — they are not independently verifiable.

Evidence Storage

All archived copies go to cases/{project}/research/archived/.

Naming convention: {domain}-{slug}-archived-{YYYYMMDD}.md

Examples:

cases/project-name/research/archived/reuters-ukraine-ceasefire-archived-20260315.md
cases/project-name/research/archived/companyreg-gov-uk-filing-archived-20260315.md

Chain of Custody Block

Embed this header in every archived file. It is the provenance record.

---
archived_at: 2026-03-15T14:22:00Z
original_url: https://example.com/article/path
archive_url: https://web.archive.org/web/20260315142200/https://example.com/article/path
archive_service: Wayback Machine
archived_by: investigator | fact-checker
case: {project}
---

Without this block, the file is not a valid archived source — it is just a local copy.

Dead Pages

If the original URL returns 404 or is otherwise gone, check Wayback Machine before marking the source as unavailable:

curl "http://web.archive.org/cdx/search/cdx?url={URL}&output=json&limit=1&fl=timestamp,statuscode"

If a snapshot exists, retrieve it:

curl "https://web.archive.org/web/{TIMESTAMP}/{URL}" -o .research/archived/{filename}.md

Only mark a source as unavailable after checking all three services.

Integration with Findings

Add archive_url to every source entry in findings.json and fact-check.json:

{
  "url": "https://example.com/article",
  "type": "news",
  "accessed": "2026-03-15T14:20:00Z",
  "archive_url": "https://web.archive.org/web/20260315142200/https://example.com/article",
  "local_file": "cases/{project}/research/archived/example-article-archived-20260315.md"
}

If the page could not be archived, set archive_url to null and note why in the finding's confidence_rationale.

Rules

Archive before you cite. Not after.
Never delete archived files. Even if the finding they support is later disproven — keep the file. It documents what existed at the time.
One URL = one archive attempt per service. Don't spam archive services.
Timestamping is mandatory. The investigation-log.json already tracks access time per source — ensure it matches the archived copy's timestamp.

Credits

Adapted from claude-skills-journalism by Jay Amditis, released under MIT License.