Scrape AI research URLs, archive with frontmatter, create GitHub Issues with identity verification. TRIGGERS - scrape research, archive findings, save ChatGPT share, save Gemini research, research to issue.
From gh-toolsnpx claudepluginhub terrylica/cc-skills --plugin gh-toolsThis skill is limited to using the following tools:
references/evolution-log.mdreferences/frontmatter-schema.mdreferences/url-routing.mdExecutes pre-written implementation plans: critically reviews, follows bite-sized steps exactly, runs verifications, tracks progress with checkpoints, uses git worktrees, stops on blockers.
Guides idea refinement into designs: explores context, asks questions one-by-one, proposes approaches, presents sections for approval, writes/review specs before coding.
Dispatches parallel agents to independently tackle 2+ tasks like separate test failures or subsystems without shared state or dependencies.
Scrape AI research conversations (ChatGPT, Gemini, Claude) and web pages, archive them as markdown files with YAML frontmatter, and create cross-referenced GitHub Issues — with mandatory identity verification at every step.
Self-Evolving Skill: This skill improves through use. If instructions are wrong, parameters drifted, or a workaround was needed — fix this file immediately, don't defer. Only update for real, reproducible issues.
MANDATORY: Select and load the appropriate template before any archival work.
1. Identity preflight — verify GH_ACCOUNT or resolve via curl /user
2. Scrape URL — route to Firecrawl or Jina per url-routing.md
3. Save to file — YYYY-MM-DD-{slug}-{source_type}.md with frontmatter
4. Survey labels — gh label list, reuse existing, max 3-6
5. Create GitHub Issue — use --body with heredoc or --body-file
6. Update frontmatter — add github_issue_url and github_issue_number
7. Post canonical backlink comment on Issue
1. Identity preflight (still required for consistency)
2. Scrape URL — route to Firecrawl or Jina per url-routing.md
3. Save to file — YYYY-MM-DD-{slug}-{source_type}.md with frontmatter
1. Identity preflight
2. Read existing file frontmatter
3. Survey labels — gh label list, reuse existing, max 3-6
4. Create GitHub Issue — use --body with heredoc or --body-file
5. Update file frontmatter with issue cross-reference
6. Post canonical backlink comment on Issue
MUST execute before any gh write command. Non-negotiable.
The gh-repo-identity-guard.mjs PreToolUse hook provides a safety net, but this skill performs its own check as defense-in-depth.
GH_ACCOUNT env var (set by mise per-directory)~/.claude/.secrets/gh-token-* for single base matchcurl -sH "Authorization: token $GH_TOKEN" https://api.github.com/user/usr/bin/env bash << 'IDENTITY_EOF'
# Resolve authenticated user
if [ -n "$GH_ACCOUNT" ]; then
AUTH_USER="$GH_ACCOUNT"
AUTH_SOURCE="GH_ACCOUNT"
else
AUTH_USER=$(curl -sf --max-time 5 -H "Authorization: token $GH_TOKEN" \
https://api.github.com/user 2>/dev/null | grep -o '"login":"[^"]*"' | cut -d'"' -f4)
AUTH_SOURCE="API /user"
fi
# Resolve target repo owner
REPO_OWNER=$(git remote get-url origin 2>/dev/null | sed -n 's|.*github\.com[:/]\([^/]*\)/.*|\1|p')
echo "Authenticated as: $AUTH_USER (via $AUTH_SOURCE)"
echo "Target repo owner: $REPO_OWNER"
if [ "$AUTH_USER" != "$REPO_OWNER" ]; then
echo ""
echo "MISMATCH — do NOT proceed with gh write commands"
echo "Fix: export GH_TOKEN=\$(cat ~/.claude/.secrets/gh-token-$REPO_OWNER)"
exit 1
fi
echo "Identity verified — safe to proceed"
IDENTITY_EOF
BLOCK if mismatch — display diagnostic and do NOT continue to any gh write operation.
Route scrape requests based on URL pattern. See url-routing.md for full details.
URL contains chatgpt.com/share/
→ Jina Reader (https://r.jina.ai/{URL})
→ Use curl (not WebFetch — it summarizes instead of returning raw)
URL contains gemini.google.com/share/
→ Firecrawl (JS-heavy SPA)
→ Preflight: ping -c1 -W2 172.25.236.1
URL contains claude.ai/artifacts/ or is a static web page
→ Jina Reader (https://r.jina.ai/{URL})
→ Use WebFetch or curl
CRITICAL: Firecrawl containers can show "Up" in docker ps while internal processes are dead (RAM/CPU overload crashes the worker inside the container). Always perform a deep health check before scraping.
/usr/bin/env bash << 'SCRAPE_EOF'
set -euo pipefail
# Step 1: Check ZeroTier connectivity
if ! ping -c1 -W2 172.25.236.1 >/dev/null 2>&1; then
echo "ERROR: Firecrawl host unreachable. Check ZeroTier: zerotier-cli status"
exit 1
fi
# Step 2: Deep health check — test actual API response, not just container status
# Port 3003 (wrapper) may accept TCP but return empty if Firecrawl API (3002) is dead inside
HTTP_CODE=$(ssh littleblack 'curl -sf -o /dev/null -w "%{http_code}" --max-time 10 \
-X POST http://localhost:3002/v1/scrape \
-H "Content-Type: application/json" \
-d "{\"url\":\"https://example.com\",\"formats\":[\"markdown\"]}"' 2>/dev/null || echo "000")
if [ "$HTTP_CODE" = "000" ] || [ "$HTTP_CODE" = "502" ] || [ "$HTTP_CODE" = "503" ]; then
echo "WARNING: Firecrawl API unhealthy (HTTP $HTTP_CODE). Attempting revival..."
# Step 2a: Check docker logs for WORKER STALLED (RAM/CPU overload)
ssh littleblack 'docker logs firecrawl-api-1 --tail 20 2>&1 | grep -i "stalled\|error\|exit" || true'
# Step 2b: Restart the critical containers
ssh littleblack 'docker restart firecrawl-api-1 firecrawl-playwright-service-1' 2>/dev/null
echo "Containers restarted. Waiting 20s for API to initialize..."
sleep 20
# Step 2c: Verify recovery
HTTP_CODE=$(ssh littleblack 'curl -sf -o /dev/null -w "%{http_code}" --max-time 10 \
-X POST http://localhost:3002/v1/scrape \
-H "Content-Type: application/json" \
-d "{\"url\":\"https://example.com\",\"formats\":[\"markdown\"]}"' 2>/dev/null || echo "000")
if [ "$HTTP_CODE" = "000" ] || [ "$HTTP_CODE" = "502" ] || [ "$HTTP_CODE" = "503" ]; then
echo "ERROR: Firecrawl still unhealthy after restart (HTTP $HTTP_CODE)."
echo "Manual intervention needed. Try: ssh littleblack 'cd ~/firecrawl && docker compose up -d --force-recreate'"
echo "Falling back to Jina Reader: https://r.jina.ai/${URL}"
exit 1
fi
echo "Firecrawl recovered successfully."
fi
# Step 3: Scrape via wrapper
CONTENT=$(curl -s --max-time 120 "http://172.25.236.1:3003/scrape?url=${URL}&name=${SLUG}")
if [ -z "$CONTENT" ]; then
echo "ERROR: Scrape returned empty. Try Jina fallback: https://r.jina.ai/${URL}"
exit 1
fi
echo "$CONTENT"
SCRAPE_EOF
Symptom: docker ps shows containers with status "Up 4 days" but curl localhost:3002 returns connection reset.
Root cause: Firecrawl worker exhausts RAM/CPU (observed: cpuUsage=0.998, memoryUsage=0.858). Internal Node.js processes exit but Docker container stays alive because the entrypoint shell is still running.
Diagnosis:
ssh littleblack 'docker logs firecrawl-api-1 --tail 50 2>&1 | grep -E "STALLED|cpuUsage|exit"'
# Look for: WORKER STALLED {"cpuUsage":0.998,"memoryUsage":0.858}
Fix: docker restart (not docker compose restart — may require permissions to compose directory):
ssh littleblack 'docker restart firecrawl-api-1 firecrawl-playwright-service-1'
sleep 20 # Wait for API initialization
# Verify:
ssh littleblack 'curl -s -o /dev/null -w "%{http_code}" http://localhost:3002/v1/scrape'
YYYY-MM-DD-{slug}-{source_type}.md
slug — kebab-case summary (max 50 chars)source_type — from enum: chatgpt, gemini, claude, webDefault location: docs/research/ in the current project.
See frontmatter-schema.md for the full field contract.
---
source_url: https://chatgpt.com/share/...
source_type: chatgpt-share
scraped_at: "2026-02-09T18:30:00Z"
model_name: gpt-4o
custom_gpt_name: Cosmo
claude_code_uuid: SESSION_UUID
github_issue_url: ""
github_issue_number: ""
---
Leave github_issue_url and github_issue_number empty — update after Issue creation.
Survey existing labels first — reuse preferred, create only when concept is genuinely novel.
gh label list --repo owner/repo --limit 100
Policy: Max 3-6 labels per issue. Common labels: research, ai-output, chatgpt, gemini, archival.
Use --body with heredoc for inline composition, or --body-file for very large content.
/usr/bin/env bash << 'ISSUE_EOF'
# Write body to temp file
cat > "/tmp/issue-body-${SLUG}.md" << 'BODY_EOF'
## Summary
Brief description of the archived research content.
## Source
- **URL**: SOURCE_URL
- **Type**: source_type
- **Model**: model_name
- **Scraped**: scraped_at
## Key Findings
- Finding 1
- Finding 2
## Archived File
`docs/research/FILENAME.md`
BODY_EOF
# Create issue
gh issue create \
--repo owner/repo \
--title "Research: descriptive title here" \
--body-file "/tmp/issue-body-${SLUG}.md" \
--label "research,ai-output"
# Clean up
rm -f "/tmp/issue-body-${SLUG}.md"
ISSUE_EOF
After issue creation, update the archived file's frontmatter with the issue URL and number.
Post a comment on the Issue linking back to the archived file:
**Archived**: `docs/research/YYYY-MM-DD-slug-source_type.md`
Scraped: 2026-02-09T18:30:00Z
Source: [chatgpt-share](https://chatgpt.com/share/...)
Session: SESSION_UUID
After modifying THIS skill:
./references/ links resolveuv run plugins/plugin-dev/scripts/skill-creator/quick_validate.py plugins/gh-tools/skills/research-archivalbun run plugins/plugin-dev/scripts/validate-links.ts plugins/gh-tools/skills/research-archival| Issue | Cause | Fix |
|---|---|---|
| Wrong account posting | GH_TOKEN mismatch | Check mise env | grep GH_TOKEN, verify GH_ACCOUNT |
| Body exceeds 65536 chars | GitHub API limit | Split across issue body + first comment |
| Firecrawl unreachable | ZeroTier down | ping 172.25.236.1, check zerotier-cli status |
| Firecrawl "Up" but dead | Container alive, processes crashed | docker restart firecrawl-api-1 firecrawl-playwright-service-1, wait 20s |
| Firecrawl WORKER STALLED | RAM/CPU overload (>85% mem) | Same as above; check docker logs firecrawl-api-1 --tail 50 |
| Scrape returns empty | JS-heavy page timeout | Increase Firecrawl timeout, try Jina fallback |
| Jina returns login page shell | Gemini login wall (not rendered) | Must use Firecrawl for gemini.google.com/share/* URLs |
| mise parse error | Stale .mise.toml syntax | Run mise doctor, check [hooks.enter] syntax |
| Identity guard blocks | Non-owner account | export GH_TOKEN=$(cat ~/.claude/.secrets/gh-token-OWNER) |
After this skill completes, check before closing:
Only update if the issue is real and reproducible — not speculative.