Help us improve
Share bugs, ideas, or general feedback.
Converts raw files (PDF, docx, images, audio, etc.) into a local Markdown vault with retrieval-friendly frontmatter, then answers questions over it with self-monitoring and MOC proposals.
How this skill is triggered — by the user, by Claude, or both
Slash command
/market-research-skills:local-vaultThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Turn a folder of raw files into a **Markdown vault** that an LLM can grep, and
Share bugs, ideas, or general feedback.
Turn a folder of raw files into a Markdown vault that an LLM can grep, and then answer questions over that vault responsibly.
Mental model: SOURCE = raw files (source of truth). VAULT = one .md per
source file, carrying retrieval frontmatter (abstract / tags / synonyms) + a
source backlink. The vault is the layer the LLM reads; the raw files are where
the user goes to verify.
There are two distinct jobs — figure out which the user wants:
scripts/sync.py).python3 -m pip install --user requests python-dotenv pypdf pymupdf4llm openpyxl python-pptx
brew install pandoc (macOS) / distro pkg.brew install ffmpeg (macOS) /
distro pkg. The whisper engine is auto-selected by platform — mlx-whisper
on Apple Silicon (GPU), faster-whisper elsewhere (cross-platform CPU/CUDA) — and
auto-installed after the user consents at the first-run prompt (no manual pip
needed). On that first run with audio/video present, the tool shows the model-size
options (tiny ~75 MB / small ~480 MB / turbo ~1.6 GB / large-v3 ~3 GB) and lets the
user pick or skip; the choice is saved to .env (KB_WHISPER_MODEL) so it never
re-asks. Fully local — no token/quota; the model downloads once, then offline.claude CLI on PATH — the pipeline shells out to claude -p for frontmatter
enrichment and PPT-image OCR. If absent, those steps are skipped (not fatal).python3 scripts/sync.py in a
terminal. On first run (when paths aren't configured yet) it launches an
interactive wizard: it asks for the raw-files folder + the vault folder
(+ optional MinerU token), creates them, writes scripts/.env, and prints how
to use the tool. Then they re-run to convert.scripts/.env.example → scripts/.env and set
KB_SOURCE_DIR (raw files) and KB_TARGET_DIR (the Markdown vault), both
absolute. MINERU_TOKEN is optional (only for legacy .doc/.ppt, .html, scanned
PDFs, images — get one at https://mineru.net).scripts/.env directly (the wizard only
fires on an interactive TTY, which a claude -p subprocess is not).python3 scripts/sync.py
On macOS, the first run (wizard or any normal run) also drops a clickable
sync.command into the knowledge-base root — the parent of the SOURCE folder,
with the absolute path to sync.py baked in (tool and data live apart — under
/plugin install the script sits in ~/.claude/plugins/cache/…, far from the data
folders, so a relative launcher can't work). After that the daily loop is: drop
files into SOURCE → double-click sync.command → read the .md in VAULT. The
launcher is idempotent; a stale auto-generated copy left in the SOURCE folder by an
older version is removed automatically (a user-written one is never touched). If a
different sync.command already exists at the root, an interactive terminal
prompts update / skip; non-interactively, our own out-of-date launcher self-heals
silently while a user-customized one is left alone.
First terminal run with no config → the setup wizard (above). Once .env exists:
.md in VAULT are
processed. To force a re-convert, delete that .md first, then re-run..md —
together with its attachments/<stem>/ images — is moved to an orphaned/<date>/
folder (never hard-deleted — the user may have added notes), and the now-empty
attachments/ is pruned. User-written .md (no converter marker) is never touched.| Type | Tool | Notes |
|---|---|---|
.xlsx | openpyxl dual-read | per sheet: value grid (with A/B/C + row coords) + formulas list |
.csv / .tsv | csv → Markdown table | truncates past CSV_MAX_ROWS |
.pdf (digital) | pymupdf4llm | local, fast, no quota; if PYMUPDF4LLM_WRITE_IMAGES (default on), images ≥ PYMUPDF4LLM_IMAGE_SIZE_LIMIT (12% of page) → attachments/, then filtered by min-bytes + de-dup. If pymupdf4llm crashes (e.g. missing-font), a local plain-text pass is tried before MinerU |
.pdf (scanned) | MinerU vlm (fallback) | triggered when chars/page is too low |
.docx/.rtf/.odt/.epub | pandoc | images extracted to attachments/ |
.html/.htm | pandoc (local) | style/class/id attrs + layout div/section/span stripped first, so only content survives; tables kept lossless. No MinerU/token needed |
.pptx | python-pptx | title/body/tables/charts/notes + images; smart OCR (see below) |
.md/.markdown/.txt | passthrough | copied verbatim; only frontmatter added, body untouched |
.json/.yaml/.py/… | code passthrough | wrapped in a fenced code block + frontmatter |
audio .mp3/.m4a/.wav/… + video .mp4/.mov/.m4v | whisper (local; engine auto-selected: mlx-whisper on Apple Silicon, else faster-whisper) | speech-to-text, no token/quota; first run asks which model (shows sizes) + auto-installs the engine on consent (a model already cached on this machine is reused without re-asking); per-segment [mm:ss] timestamps + detected language; video = audio-track only (ffmpeg pulls it from the container). Needs ffmpeg; best on clear speech — songs/music transcribe poorly |
legacy .doc/.ppt, images | MinerU (cloud) | local libs can't read these |
| anything else (numbers/pages/zip/…) | skipped | reported at the end with a fix hint — never silently dropped |
Images embedded in slides are OCR'd via claude -p (its Read tool reads the
image), but to avoid spawning one slow claude per decorative logo:
de-duplicates identical images (OCR once), skips images below
OCR_MIN_IMAGE_BYTES, and runs unique content images concurrently
(OCR_MAX_WORKERS). Native PowerPoint chart objects are read directly
(categories + series values → a table). Set OCR_PPTX_IMAGES = False to turn OCR
off entirely (images are still extracted + referenced).
.md---
source: "[[…/<file>.<ext>]]" # backlink to the raw file
source_type: pdf | xlsx | docx | pptx | md | …
converted_by: pymupdf4llm | pandoc | python-pptx | excel-openpyxl | csv | passthrough | whisper | "MinerU vlm" | …
# enrich (best-effort via claude -p, may be missing on failure):
abstract: |
3-sentence summary.
auto_tags: [..]
synonyms: [English + 中文 同义词] # so any phrasing greps the right doc
key_data: ["important numbers/facts"]
---
scripts/config.py)PYMUPDF4LLM_MIN_CHARS_PER_PAGE (scanned-PDF threshold) ·
PYMUPDF4LLM_WRITE_IMAGES (digital-PDF image extraction on/off; .env:
KB_PDF_NO_IMAGES=1 to disable) · PYMUPDF4LLM_IMAGE_SIZE_LIMIT (extraction
floor as fraction of page area; default 0.12) · PYMUPDF4LLM_IMAGE_MIN_BYTES
(drop images smaller than this; default 6000) · OCR_PPTX_IMAGES /
OCR_MIN_IMAGE_BYTES / OCR_MAX_WORKERS (PPT image OCR) ·
EXCEL_MAX_CELLS_PER_SHEET · CSV_MAX_ROWS · ENRICH_FRONTMATTER.
When the user asks you to answer from / compare across their vault, read the
vault directly (grep + read .md). While doing so, self-monitor and surface
problems — don't just answer.
find "$KB_TARGET_DIR" -name "*.md" -not -path "*/.obsidian/*" | wc -l # file count
Set a rough scale and only mention it if there's a problem: small (<100 files) agentic grep is plenty · medium (100–500) watch keyword hit counts · large (500–2000) suggest a semantic-search layer (e.g. Smart Connections) · huge (>2000) recommend a real RAG layer.
| Signal | Tell the user |
|---|---|
| one grep hits >30 files | keyword too broad — give a narrower one, or add semantic search |
| read 5+ files, still no answer | maybe a synonym gap, or it's genuinely not in the vault — list what you read |
| same topic asked repeatedly | offer to build an index/MOC for it |
| "which chapter covers X" needs full read-through | offer to enrich an outline for that doc |
a doc is missing abstract | its enrich likely failed — offer to redo it |
| question needs exact numbers/formulas | remind them to click the source backlink and verify against the original |
A MOC (Map of Content) is the user's entry note for a theme — frontmatter
type: moc, living in <vault>/索引/ (or index/).
## related files list — nothing more).Frequency limits (avoid nagging): at most one MOC-evolution proposal per
session; skip if this MOC was proposed-on <7 days ago; require a real multi-signal
pattern, not one offhand question; keep proposals to a single > 💡 … blockquote.
.md files (user notes and tool output coexist).claude -p for the optional enrich/OCR steps.scripts/ here are a packaged snapshot.npx claudepluginhub genli-ai/market-research-skillsIngests content from Confluence, Google Docs, GitHub repos, remote URLs, or local files (DOCX, PDF, etc.) into Second Brain vault. Converts to Markdown via docling, runs graphify extraction, persists entities.
Guides Claude through searching, reading, and reasoning about notes in a markdown vault (hybrid search, backlinks, context, reads).
Extracts structured metadata from typed vault files (books, meetings, people, articles, goals) and surfaces cross-type insights. Use to refresh a queryable vault index or discover cross-document patterns.