Help us improve
Share bugs, ideas, or general feedback.
From akb-wiki
Ingests a local file or web URL into an AKB vault as a structured five-section LLM-wiki summary, optionally preserving original binaries in raw file storage. One source per invocation.
npx claudepluginhub dnotitia/akb --plugin akb-wikiHow this agent operates — its isolation, permissions, and tool access model
Agent reference
akb-wiki:agents/ingest-docsonnetThe summary Claude sees when deciding whether to delegate to this agent
Ingest **one document** (local path or URL) into the target AKB vault as a five-section summary page, optionally preserving the original bytes in AKB's raw file layer. This skill is the `akb-wiki` plugin's ingest write path. **Workflow classification.** LLM-wiki **Ingest** workflow. **Intent-debt** layer — it externalizes documents of record (papers, articles, manuals, transcripts, web clips) i...
Fetches one Confluence page via the Atlassian MCP server and writes it into an AKB vault as a structured five-section LLM-wiki summary document.
Passive agent that captures verbatim user-provided sources (URLs, PDFs, GitHub links, text, session logs) into raw/ and wiki/sources/ with SHA-256 provenance. No interpretation.
Knowledge management agent that ingests content from web, files, Notion, images and exports structured notes to Obsidian, Notion, Markdown, PDF, blogs.
Share bugs, ideas, or general feedback.
Ingest one document (local path or URL) into the target AKB vault as a five-section summary page, optionally preserving the original bytes in AKB's raw file layer. This skill is the akb-wiki plugin's ingest write path.
Workflow classification. LLM-wiki Ingest workflow. Intent-debt layer — it externalizes documents of record (papers, articles, manuals, transcripts, web clips) into the vault so the rationale and constraints in those sources stay retrievable by humans and agents.
Scope boundaries:
akb_put_file and emit a summary document with a derived_from link to the raw file. Plaintext formats emit only the summary.corpus-summaries, raw files in corpus-raw. No --collection override — use tags for sub-organization./ingest-doc <source> --vault {vault_name} [--raw | --no-raw] [--lang {language}]
<source> (required) — local path (absolute, or glob like ./papers/*.pdf) or URL (http/https).--vault (required) — target AKB vault name.--raw / --no-raw (optional) — override format-based raw-upload default (see Step 2).--lang (optional) — language for synthesized note body (TL;DR / Key Points / topic tag wording). If omitted, infer from the source's primary language. Use this to override when the source language differs from the desired note language (e.g., English paper → Korean summary).--vault {name} provided.WebFetch available.Parse --vault from arguments. Hard-fail if missing: --vault {name} required.
Parse remaining CLI arguments. Capture today's date as today (ISO YYYY-MM-DD).
Classify <source>:
/, ./, ~/, or has no scheme. If it contains glob metacharacters (*, ?, [), resolve with Glob and loop through each match, running Steps 3–7 per file. A zero-match glob fails with No files matched "{pattern}".http:// or https://.Detect format from extension (path) or Content-Type header when available (URL), plus filename hint from the URL path:
| Format | Extension set | Upload default |
|---|---|---|
pdf | .pdf | binary → raw |
image | .png, .jpg, .jpeg, .gif, .webp | binary → raw |
office | .docx, .xlsx, .pptx | binary → raw |
markdown | .md, .markdown | plaintext → summary only |
html | .html, .htm, URL with no extension | plaintext → summary only |
text | .txt, .log | plaintext → summary only |
json | .json, .jsonl, .yaml, .yml | plaintext → summary only |
Unsupported extensions fail with Unsupported format: {ext}. Supported: pdf, png/jpg/gif/webp, docx/xlsx/pptx, md, html, txt, json.
Apply flag overrides:
--raw → force raw upload regardless of format default (useful when the user wants to preserve a .md file's original bytes).--no-raw → suppress raw upload regardless of format default (useful when the user considers a PDF disposable and only wants the extracted text).Record raw_upload ∈ {true, false} and format_kind ∈ {binary, plaintext} for downstream steps.
Always obtain extracted text for the ## Content section. Additionally, when raw_upload == true, obtain a local file path pointing at the original bytes for akb_put_file.
Read(file_path="{path}")
pages="1-20" at a time). If the PDF's total page count is unknown, attempt without pages first; chunk only when the single read fails or exceeds token budget.Read cannot extract (e.g. .docx, .xlsx), surface the failure: Cannot extract text from {format}; install a converter or supply an already-converted .md. The raw bytes can still be uploaded if raw_upload == true — ingest proceeds with an empty ## Content and a note "text extraction unsupported for this format."local_file_path is the input path itself.WebFetch(url="{url}", prompt="Return the main article content as clean markdown. Preserve headings, lists, and code blocks. Drop navigation, ads, footers.")
If raw_upload == true, also download the URL bytes to a temp path (/tmp/ingest-doc-{short_hash}.{ext}) for akb_put_file:
raw_upload=true (only happens when --raw is explicit): save the WebFetch'd HTML/markdown verbatim to temp.Bash(curl -sL -o /tmp/... {url}) — note this in the output if taken.)Record extracted_text (may be empty) and local_file_path (may be null).
Two-stage deterministic / semantic check, then a fully automatic decision table — no caller flag, no interactive prompt. The skill absorbs the dedup decision (LLM-wiki agent absorbs bookkeeping).
Compute new_hash = sha256(extracted_text) once up front — used as both the path-source lookup key and the equality test in the decision table.
Build a lookup_key: canonical URL (strip tracking params like utm_*, ref, fbclid) for URL sources, new_hash for path sources. Query akb_search(vault="{vault_name}", query={lookup_key}, collection="corpus-summaries", type="reference", limit=5).
For each hit, akb_get(vault="{vault_name}", doc_id={hit.doc_id}) and inspect the ## Source block. Match on either a Location: line containing the canonical URL or a Content-SHA256: line matching new_hash (only present when a plaintext path was previously ingested). On match, set dedup_hit = doc_id, parse existing_hash from the existing doc's Content-SHA256: (may be absent if the prior ingest had no raw upload), set dedup_kind = "deterministic", and skip Stage 2.
Runs only when Stage 1 produced no hit. Produce a provisional title (URL <title> tag, PDF metadata, markdown # heading, or filename stem — whichever is available). Query akb_search(vault="{vault_name}", query={provisional_title}, collection="corpus-summaries", type="reference", limit=5). Any hit with score > 0.85 counts as fuzzy: set dedup_hit = top_hit.doc_id, dedup_kind = "fuzzy". The hit is a candidate, not a confirmed identity match.
dedup_hit / kind | Comparison | Action |
|---|---|---|
| none | — | continue to Step 5 (create path) |
| deterministic | existing_hash == new_hash | return ## ingest-doc: unchanged (see Step 7) and stop. Content identical — no compiled-truth refresh, no timeline append. |
| deterministic | existing_hash != new_hash (or existing_hash absent) | continue to Step 5; Step 6 uses akb_update on dedup_hit (compiled-truth refresh + timeline append). |
| fuzzy | — | return ## ingest-doc: likely-exists and stop. The fuzzy hit is a candidate, not a confirmed identity match — leave disambiguation to the external maintenance layer or human review. |
Edge cases (forced create on a fuzzy false-positive, forced replace despite hash equality) are out of scope for this skill — handle via a separate disambiguation workflow rather than a per-call flag. This keeps the ingest path headless-safe and the skill the sole owner of the dedup decision.
Skip when raw_upload == false. Otherwise: akb_put_file(vault="{vault_name}", file_path={local_file_path}, collection="corpus-raw", description={provisional_title}). Capture file_id and s3_key from the response; compose Raw URI = akb://{vault_name}/file/{file_id}. If local_file_path was a temp file (URL download), delete it after the upload succeeds: Bash(rm -f {local_file_path}).
Compose the summary document body using the fixed five-section structure. The section headings are fixed; the content inside each section adapts to the source.
Language: write TL;DR / Key Points / topic tag descriptors in --lang if it was provided; otherwise match the source's primary prose language (not embedded code/quotes). Section headings, frontmatter keys, and the ## Content zone (mechanical extracted text) always stay in their natural form — headings/keys English, content as-extracted.
# {title}
## TL;DR
{Agent-authored 1–3 sentences. What is this document, at a synthesis level. No hedging, no "this document discusses..." boilerplate.}
## Key Points
- {Bullet 1 — the sharpest 3–7 takeaways. For a paper: contribution, method, result. For an article: thesis, argument, conclusion. For a manual: what problem it solves, prerequisites, caveat. For a transcript: decisions, actions, open questions.}
- {Bullet 2}
- ...
## Content
{Cleaned extracted text from Step 3. Preserve original heading hierarchy, code blocks, tables. Drop navigation/ads/footers. For binary sources with no text extraction, write: "Text extraction unsupported for this format. See raw file at {raw_uri}."}
## Source
- Location: {source URL or absolute path}
- Raw: akb://{vault_name}/file/{file_id} ← omit this line entirely when `raw_upload == false`
- Format: {format (pdf, html, md, …)}
- Content-SHA256: {hash of extracted_text, for future dedup}
- Ingested: {today}
---
## Timeline
- **{today}** | Ingested from {source}. [Source: ingest-doc, {today}]
Use only AKB standard fields:
type: reference
status: active
title: "{title}"
summary: "{one-sentence synthesis, same voice as TL;DR but compressed to one line}"
tags: ["{format_kind}", "{format}", "topic:{primary}", "topic:{secondary}"]
depends_on: []
related_to: []
implements: []
format_kind is binary or plaintext; format is the specific extension (pdf, md, …). Both are derived from the source.topic:{...} entries are 1–3 LLM-derived topic tags, authored alongside TL;DR / Key Points. After synthesizing the body, identify the 1–3 sharpest subjects of the document — e.g., a paper on retrieval-augmented generation gets topic:rag, topic:retrieval. Rules: lowercase, hyphenated for compounds (topic:vector-search), prefer domain-anchored vocabulary over generic words (introduction / overview / tutorial / notes are not topics), drop a tag rather than reach for filler. Empty topic list is allowed only when the source has no extractable subject (extraction failed, content is a directory listing or single-line note); downstream consolidation will then silently skip the summary.tags array, no caller-supplied tags.depends_on / related_to / implements are empty at ingest time — relation-wiring is the external maintenance layer's job.Branch by Step 4 decision.
New document (no hit). akb_put(vault="{vault_name}", collection="corpus-summaries", title={title}, type="reference", tags={tags}, summary={one_sentence_summary}, content={assembled_body}). Capture doc_id and doc_path.
Replace (deterministic hit with hash differing / absent):
akb_get(vault="{vault_name}", doc_id={dedup_hit}) → existing_content.new_entry = "- **{today}** | Re-ingested from {source}. [Source: ingest-doc, {today}]" and existing_content as-is.content: everything from the newly composed body above the --- + ## Timeline line, then the merged timeline from Mode B. This replaces the compiled-truth zone (TL;DR / Key Points / Content / Source) while preserving all historical timeline entries.akb_update(vault="{vault_name}", doc_id={dedup_hit}, content={merged_content}, message="Re-ingested from {source}").When raw_upload == true and a doc was created or replaced, materialize the lineage edge: akb_link(vault="{vault_name}", source="akb://{vault_name}/doc/{doc_path}", target="akb://{vault_name}/file/{file_id}", relation="derived_from").
On the replace path, first check akb_relations(vault="{vault_name}", resource_uri="akb://{vault_name}/doc/{doc_path}", type="derived_from", direction="outgoing") — the raw file_id may have changed or the prior ingest may have used --no-raw. Skip akb_link idempotently when the edge to the new file_id already exists. If a derived_from edge points to a different file/{id}, the prior raw is orphaned — do not delete it (other docs may reference it); leave it and emit a warning in the output.
Emit one of four single-source report blocks based on Step 4 + Step 6 outcome:
## ingest-doc: created
- title: {title}
- doc_id: {doc_id}
- path: {doc_path}
- vault: {vault_name}
- collection: corpus-summaries
- raw_file_id: {file_id_or_none}
- summary: {one_sentence_summary}
## ingest-doc: replaced
- title: {title}
- doc_id: {dedup_hit}
- vault: {vault_name}
- previous raw: {old_file_id_or_none}
- new raw: {file_id_or_none}
- timeline entries: {N (after append)}
## ingest-doc: unchanged
- source: {source}
- doc_id: {dedup_hit}
- vault: {vault_name}
## ingest-doc: likely-exists
- source: {source}
- top match: {dedup_hit} (score: {score})
- vault: {vault_name}
Glob loop: emit one block per source processed, then a final summary ## ingest-doc: batch complete with created / replaced / unchanged / likely-exists / failed counts.
| Situation | Response |
|---|---|
<source> missing | Missing required argument: <source> |
| Path not found | File not found: {path} |
| Glob matched zero files | No files matched "{pattern}". |
| Unsupported format | Unsupported format: {ext}. Supported: pdf, png/jpg/gif/webp, docx/xlsx/pptx, md, html, txt, json. |
| URL unreachable / 4xx / 5xx | Could not fetch {url}: {status} |
| Text extraction failed on binary | Record with ## Content = "Text extraction unsupported for this format. See raw file at {raw_uri}." Not a hard failure when raw_upload == true. Hard failure when raw_upload == false. |
| AKB write fails | Surface the error verbatim; do not retry. Partially-uploaded raw file is not rolled back at MVP — log orphaned raw file: {file_id} in the output so the user can akb_delete_file manually. |
--vault not passed | --vault {name} required. |
| Vault does not exist | Vault "{vault_name}" not found. Create with mcp__akb__akb_create_vault first. |
| AKB MCP server unreachable | AKB MCP server not accessible. Check your MCP configuration. |
Do not attempt to create vaults or collections. If the supplied vault_name does not exist, akb_put/akb_put_file will error with a clear message — surface it as-is and instruct the user to create the vault with mcp__akb__akb_create_vault.
Shared procedure for every skill that writes back to an existing note. Two modes — every caller picks one explicitly.
Shared helper. split_body(content) — locate the first standalone --- line that is followed (eventually) by a ## Timeline heading. Return (body_above, timeline_entries). If no ## Timeline heading exists, return (content, []).
Used when the caller has produced a rewritten compiled-truth body (the new canonical state) plus exactly one new timeline entry. Callers: session-ingest Phase 4-2 (session update), session-ingest Phase 4-3 (sub-note append), and ingest-jira (issue upsert).
Inputs: draft_body (full new body — includes --- + ## Timeline + one new bullet), existing_content (current doc).
(draft_above, draft_entries) = split_body(draft_body). draft_entries must contain exactly one bullet. If it contains zero, treat as drafter error → fall back to akb_update with draft_body unchanged and log a warning.(_, existing_entries) = split_body(existing_content).{draft_above}
---
## Timeline
{draft_entries[0]}
{existing_entries}
Used when the caller produced only a new timeline bullet and wants the existing compiled-truth zone preserved verbatim. Callers: session-ingest Phase 4-4 (task resolve) and the decision supersede path, ingest-doc re-ingest, and ingest-confluence re-ingest.
Inputs: new_entry (a single bullet line, possibly with sub-bullets), existing_content.
(existing_above, existing_entries) = split_body(existing_content).existing_content had no ## Timeline section, append "\n\n---\n\n## Timeline\n\n" to existing_above before proceeding.{existing_above}
---
## Timeline
{new_entry}
{existing_entries}