From mercury
Mercury Strategy collection stage 2 — four-source site discovery, section classification, content scraping, negative verification, and site structure output. Use when the consultant runs /ms-crawl. This stage collects only. It produces no findings, no evaluations, and no recommendations.
npx claudepluginhub mb-uc/mercury --plugin mercuryThis skill uses the workspace's default tool permissions.
Collection stage 2 of 3. Executes the four-source crawl methodology defined in `references/CRAWL_CONFIG.md`, classifies all discovered URLs using `references/CLASSIFICATION_RULES.md`, scrapes a representative page sample, runs negative verification for all concepts in `references/NEGATIVE_VERIFICATION_CONCEPTS.md`, and outputs a structured evidence manifest and an HTML directory tree site struc...
Guides Next.js Cache Components and Partial Prerendering (PPR) with cacheComponents enabled. Implements 'use cache', cacheLife(), cacheTag(), revalidateTag(), static/dynamic optimization, and cache debugging.
Migrates code, prompts, and API calls from Claude Sonnet 4.0/4.5 or Opus 4.1 to Opus 4.5, updating model strings on Anthropic, AWS, GCP, Azure platforms.
Automates semantic versioning and release workflow for Claude Code plugins: bumps versions in package.json, marketplace.json, plugin.json; verifies builds; creates git tags, GitHub releases, changelogs.
Collection stage 2 of 3. Executes the four-source crawl methodology defined in references/CRAWL_CONFIG.md, classifies all discovered URLs using references/CLASSIFICATION_RULES.md, scrapes a representative page sample, runs negative verification for all concepts in references/NEGATIVE_VERIFICATION_CONCEPTS.md, and outputs a structured evidence manifest and an HTML directory tree site structure file.
{company}-ms-brief-evidence.json exists. If it does not, surface a clear message and prompt the consultant to run /ms-brief first. Do not silently trigger it.firecrawl_map status from Step 4 of ms-brief, and any CRAWL_CONFIG entries for this domain.references/CRAWL_CONFIG.md in full before making any scrape calls.A partial crawl produces unreliable absence claims. Every URL classification must be supported by at least one of the four discovery sources below. No absence claim may be asserted unless it has passed the three-step negative verification procedure in references/NEGATIVE_VERIFICATION_CONCEPTS.md.
Goal: Establish the declared URL inventory.
{domain}/sitemap.xml{domain}/sitemap_index.xml if the first fetch fails or returns 404<loc> URLsnews-sitemap.xml)Record in crawl_summary.sources.sitemap:
present — sitemap exists and is comprehensive (covers 70%+ of URLs found by firecrawl_map)present_incomplete — sitemap exists but sparse relative to firecrawl_map resultsnot_found — no sitemap returnedblocked — fetch timed out or returned non-200If sitemap fails, proceed to Source 2. Do not block.
Goal: Establish the intended site structure as presented to visitors.
firecrawl_scrape (check CRAWL_CONFIG first)<nav>, <header>, and elements with nav-related class or role attributes{ url, label, nav_position } where nav_position is primary_nav, secondary_nav, footer, or homepage_bodyNote:
If homepage scrape fails, proceed with Sources 1, 3, and 4 and note the gap.
Goal: Discover all linked pages — the full reachable URL graph.
If ms-brief Step 4 already ran firecrawl_map and returned a complete result, use that output — do not re-run unnecessarily. If it returned blocked or partial, attempt again.
firecrawl_map on the root domain{domain}/robots.txt and note any significant Disallow rulesRecord in crawl_summary.sources.firecrawl_map:
complete — returned 10+ URLs with no error indicatorspartial — returned some URLs but appears incompleteblocked — returned fewer than 10 URLs or timed outIf blocked, record and rely on Sources 1, 2, and 4.
Goal: Establish the true depth of archive sections.
From the URL inventory from Sources 1–3, identify paginated sections:
For each paginated section, follow pagination links up to a maximum of 5 pages. Record:
If pagination links are not found or are JavaScript-rendered, record not_assessed — do not assume the section is shallow.
After all four sources, deduplicate the URL inventory (strip tracking parameters, remove pagination variants per references/CRAWL_CONFIG.md URL exclusion rules) and compile:
{
"domain": "",
"crawl_date": "",
"sources": {
"sitemap": "present | present_incomplete | not_found | blocked",
"navigation": "extracted | failed",
"firecrawl_map": "complete | partial | blocked",
"pagination": "assessed | not_assessed"
},
"pages_discovered": 0,
"pages_crawled": 0,
"excluded": 0,
"error_pages": 0,
"subdomains_found": [],
"robots_disallow": [],
"coverage_confidence": "high | medium | low"
}
Coverage confidence:
high — all four sources succeeded, sitemap is comprehensivemedium — 2–3 sources succeeded, or sitemap is incompletelow — only 1 source succeeded, or firecrawl_map was blockedLow coverage confidence downgrades all absence claims to not_assessed in the evidence manifest.
Classify every discovered URL using references/CLASSIFICATION_RULES.md in strict priority order:
For ambiguous URLs (see references/CLASSIFICATION_RULES.md Ambiguous URL handling), record as ambiguous and resolve during content scraping using page title and content signals.
Apply document sub-classification to all URLs classified as document type.
Apply careers platform detection (Workday, Taleo, Greenhouse etc.) if a careers link routes to an external domain.
After classification, scrape a representative sample. Use firecrawl_scrape with onlyMainContent: true unless CRAWL_CONFIG specifies an override.
If a scrape returns more than 60,000 characters, flag as nav bloat. Check CRAWL_CONFIG for a selector override. Do not pass bloated content to the evidence manifest.
Homepage, IR landing, strategy page, sustainability landing, about/at-a-glance, careers landing, newsroom landing, governance overview, leadership/board page.
Investment case, results page, annual report page, sustainability strategy page, one sustainability topic page, most recent news article.
Committee pages, sustainability reporting centre, ESG data page, graduate programme page, employee stories page, CMD/investor day page, 3–5 additional news articles.
Default depth is Tier 1 + Tier 2 unless the consultant has requested a deep audit.
For each scraped page, apply presence quality classification from references/CLASSIFICATION_RULES.md:
| Quality | Criteria |
|---|---|
present | 400+ words, structured headings, content addresses the concept |
present_thin | Fewer than 200 words or generic boilerplate |
present_stale | Not updated in 18+ months (check dates, copyright year, referenced events) |
present_documents_only | Only PDF download links, no on-page narrative |
present_external | Served via external platform |
present_generic | Not configured for the expected audience or purpose |
Record staleness signals checked: copyright year, most recent results reference, CEO name currency, financial target currency.
Read references/DOCUMENT_CHECKLIST.md before running this step.
The checklist covers 130 document types across financial reporting, investor communications, sustainability, corporate communications, governance, and more. Each item is tagged with a skill assignment — check only items tagged WR (website-research) or BOTH.
Priority pass — always run (regardless of audit depth):
Check for the following WR/BOTH items from the document inventory found during scraping. These are the highest-value documents for the findings stage:
| # | Document | Where to look |
|---|---|---|
| 1 | Annual report and accounts | IR section, PDF links |
| 3 | Prelims presentation slides | IR section, results page |
| 6 | Half-year / interim results | IR section |
| 7 | Half-year presentation slides | IR section |
| 11 | Capital markets day presentation | IR section, events |
| 13 | Investor day materials | IR section |
| 14 | Strategy update presentation | Strategy / IR section |
| 17 | Investor factsheet / factbook | IR section |
| 19 | AGM presentation | Governance / IR section |
| 32 | Sustainability report / ESG report | Sustainability section, PDF links |
| 33 | TCFD report | Sustainability section |
| 35 | Net zero transition plan | Sustainability section |
| 36 | Modern slavery statement | Footer, governance, sustainability |
| 37 | Gender pay gap report | Careers, governance section |
For each item, record:
{
"item_id": 1,
"document": "Annual report and accounts",
"status": "present | present_partial | absent | not_assessed",
"url": "",
"notes": ""
}
Extended pass — run for Tier 2 and Tier 3 audits:
Check all remaining WR and BOTH items from DOCUMENT_CHECKLIST.md (items tagged CR are not assessed at this stage — those are company-research scope). Record all checked items in the document_checklist section of the manifest.
Do not extract document contents in this stage — record presence and URL only. Document extraction is ms-brief's responsibility with consultant approval.
After scraping, run the negative verification procedure for every concept in references/NEGATIVE_VERIFICATION_CONCEPTS.md.
For each concept, follow the three-step procedure in that file:
Only after all three steps fail may the concept be recorded as absent.
Record each concept result in the evidence manifest:
{
"concept": "",
"status": "present | present_thin | present_stale | present_documents_only | present_external | present_generic | absent | not_assessed",
"verified_by": "path_match | direct_probe | site_search | not_run",
"url": "",
"checked_paths": [],
"search_query": "",
"notes": ""
}
If coverage confidence is low, record all concepts as not_assessed rather than absent — insufficient coverage means absence cannot be confirmed.
After completing the crawl, build two outputs from the classified URL inventory and scraped content.
Record the classified section inventory in the evidence manifest. For each section key, record:
{
"section_key": "",
"urls_discovered": 0,
"pages_scraped": 0,
"presence_quality": "",
"scraped_pages": [
{
"url": "",
"page_title": "",
"section_key": "",
"playbook_page_type": "",
"classification_rule": "",
"classification_confidence": "high | medium",
"presence_quality": "",
"word_count": 0,
"content_summary": "",
"key_observations": [],
"documents_linked": [],
"staleness_signals": []
}
]
}
Build a nested tree representing the site's confirmed structure. Save as {company}-ms-crawl-structure.json.
This file is consumed by the HTML and Excel renderers. Shape:
{
"name": "root",
"label": "{domain}",
"children": [
{
"name": "{section_key}",
"label": "{display name — e.g. 'Investors'}",
"url": "{section landing page URL}",
"description": "{one-sentence content summary}",
"presence_quality": "present | present_thin | present_stale | ...",
"word_count": 0,
"children": [
{
"name": "{sub-section key}",
"label": "{display name}",
"url": "{URL}",
"description": "",
"presence_quality": "",
"word_count": 0
}
]
}
]
}
Display names — use plain English labels, not snake_case keys:
| section_key | label |
|---|---|
homepage | Home |
investor_relations | Investors |
investment_case | Investment case |
financial_results | Results |
annual_report | Annual report |
governance | Governance |
esg_sustainability | Sustainability |
responsible_ai | Responsible AI |
careers | Careers |
employer_brand | Life at [Company] |
news_media | Newsroom |
about | About |
strategy | Strategy |
leadership | Leadership |
contact | Contact |
Include only sections with presence_quality that is not absent. Absent sections are recorded in the evidence manifest's negative verification results — they do not appear as nodes in the site structure tree.
Include sub-pages as children where they were discovered and scraped. Leaf nodes with no children omit the children field.
Save both files on completion:
{company}-ms-crawl-manifest.json — complete evidence manifest (structure below){company}-ms-crawl-structure.json — hierarchical site treeEvidence manifest structure:
{
"stage": "ms-crawl",
"company": "",
"domain": "",
"crawled_at": "",
"crawl_summary": {},
"section_inventory": {},
"negative_verification": {},
"subdomains": [],
"careers_platform": "",
"document_checklist": {
"priority_pass_complete": true,
"extended_pass_complete": false,
"items_checked": 0,
"items": []
},
"evidence_gaps": []
}
evidence_gaps: record every condition where data could not be collected — blocked pages, robots.txt exclusions, JavaScript-rendered content, timeouts. These feed directly into the ms-findings limitations section.
If subdomains are found (e.g. careers.{domain}, investors.{domain}):
firecrawl_map on the subdomain root only — do not deep-crawl itcrawl_summary.subdomains_found| Condition | Response |
|---|---|
| sitemap.xml absent | Proceed with Sources 2–4; note in crawl_summary |
| Homepage scrape fails | Skip Source 2 nav extraction; note gap |
| firecrawl_map blocked | Rely on sitemap + nav; set coverage_confidence: low |
| Pagination not found | Record section depth as not_assessed |
| robots.txt blocks sections | Note restriction; do not assert absence |
| Scrape returns >60K chars | Flag nav bloat; check CRAWL_CONFIG; do not pass to reasoning phase |
| Subdomain blocks crawl | Note subdomain presence; record structure as not_assessed |
| All sources partially fail | Produce manifest with available evidence; set coverage_confidence: low; surface in evidence_gaps |
The crawl never fails silently. Every degradation is recorded in crawl_summary and surfaced as an evidence gap.
After saving the crawl output files, update the existing ms_analyses row if one was created by ms-brief, or create a new one. Use the bigquery connector (mcp__bigquery__run_query in Cowork). Best-effort — skip silently if unavailable.
Check for existing row:
SELECT analysis_id FROM sector_intelligence.ms_analyses
WHERE LOWER(company) = LOWER('{company}')
AND analysis_type = 'ms_brief'
ORDER BY generated_at DESC
LIMIT 1
If row exists — update it with crawl data:
UPDATE sector_intelligence.ms_analyses
SET coverage_confidence = '{coverage_confidence}',
pages_loaded = {total_pages_discovered},
sections_assessed = ['{section_1}', '{section_2}', ...],
evidence_gaps = ['{gap_1}', '{gap_2}', ...]
WHERE analysis_id = '{analysis_id}'
If no row exists — insert a new one with analysis_type = 'ms_crawl':
SELECT GENERATE_UUID() AS analysis_id
INSERT INTO sector_intelligence.ms_analyses
(analysis_id, company, domain, generated_at, analysis_type,
coverage_confidence, pages_loaded, sections_assessed,
evidence_gaps, loaded_at)
VALUES (
'{analysis_id}',
'{company}',
'{domain}',
CURRENT_TIMESTAMP(),
'ms_crawl',
'{coverage_confidence}',
{total_pages_discovered},
['{section_1}', '{section_2}', ...],
['{gap_1}', '{gap_2}', ...],
CURRENT_TIMESTAMP()
)
Do not block: If any query fails, proceed to Stage completion.
After saving both output files, show a clean summary:
Show:
Do not show: raw JSON, criterion observations, findings, or recommendations.
Offer:
/ms-findings