From armory
Converts files and URLs to clean Markdown: PDF, DOCX, XLSX, PPTX, HTML, images (OCR), audio, CSV, YouTube using MarkItDown with trafilatura/Playwright fetches. For LLM/RAG pipelines.
npx claudepluginhub mathews-tom/armory --plugin armoryThis skill uses the workspace's default tool permissions.
Convert any file or URL to clean Markdown using [MarkItDown](https://github.com/microsoft/markitdown) as the conversion engine, with a lightweight fetch layer for URLs.
Implements Playwright E2E testing patterns: Page Object Model, test organization, configuration, reporters, artifacts, and CI/CD integration for stable suites.
Guides Next.js 16+ Turbopack for faster dev via incremental bundling, FS caching, and HMR; covers webpack comparison, bundle analysis, and production builds.
Discovers and evaluates Laravel packages via LaraPlugins.io MCP. Searches by keyword/feature, filters by health score, Laravel/PHP compatibility; fetches details, metrics, and version history.
Convert any file or URL to clean Markdown using MarkItDown as the conversion engine, with a lightweight fetch layer for URLs.
| File | Purpose |
|---|---|
references/formats.md | Per-format handling notes, internal engines, known gaps |
references/fetch.md | URL fetch layer: trafilatura + Playwright strategies |
references/install.md | Dependency install guide for all variants |
Determine the input type before touching any tool:
Input type?
Local file path -> markitdown directly
URL
YouTube URL -> markitdown directly (transcript extraction built-in)
Static page -> trafilatura fetch -> markitdown on HTML result
JS-rendered / auth -> Playwright fetch -> markitdown on result
Pasted HTML string -> markitdown directly on string
Do not use web_fetch or WebFetch for URLs — route through the fetch layer described in references/fetch.md to preserve the conversion pipeline.
uv pip show markitdown || uv pip install 'markitdown[all]' trafilatura
See references/install.md for selective installs and full dependency table.
from markitdown import MarkItDown
md = MarkItDown(enable_plugins=False)
result = md.convert("path/to/file.pdf")
print(result.text_content)
references/fetch.md).| Context | Output behaviour |
|---|---|
| Single file, user wants file | Write <input_stem>.md to same directory |
| Single file, inline request | Return Markdown in conversation |
| Batch (multiple files) | Write each to <stem>.md, summarise what was produced |
| URL | Write <slug>.md to current directory or return inline |
| Piped into another workflow | Return result.text_content string only |
Default: "convert this file" -> write a file. "Read this" or "what does this say" -> return inline.
Source (two-column PDF with a table):
Annual Report 2024 Financial Highlights
Revenue grew 12% year-over-year... | Metric | 2023 | 2024 |
| Revenue | $4.2B | $4.7B |
| EBITDA | $1.1B | $1.3B |
Converted Markdown:
# Annual Report 2024
Revenue grew 12% year-over-year...
## Financial Highlights
| Metric | 2023 | 2024 |
| ------- | ----- | ----- |
| Revenue | $4.2B | $4.7B |
| EBITDA | $1.1B | $1.3B |
Multi-column layouts merge into linear flow. Tables are preserved as Markdown tables. Headings are inferred from font size/weight.
Markitdown supports an llm_client for image description in PPTX and image files. Never enable by default — it incurs cost, latency, and unexpected API calls. Prompt the user first: "This file contains images. Do you want me to use Claude to describe them? This will make additional API calls."
import anthropic
from markitdown import MarkItDown
client = anthropic.Anthropic()
md = MarkItDown(llm_client=client, llm_model="claude-sonnet-4-6")
result = md.convert("presentation.pptx")
Opus 4.7 vision ceiling: Opus 4.7 accepts images up to 2,576 pixels on the long edge (~3.75 MP), roughly 3× prior Claude models. When routing image-heavy documents through
llm_model="claude-opus-4-7", retain higher-resolution source images rather than pre-downsampling — text in screenshots and diagrams that previously required OCR may now be readable directly.
| Severity | Condition | Action |
|---|---|---|
| Terminal | Unsupported format (no converter exists) | Report to user immediately; do not retry |
| Terminal | Password-protected Office file | Report to user; no programmatic workaround |
| Terminal | File not found / path invalid | Report exact path; ask user to verify |
| Recover | Empty output from PDF | Likely scanned — escalate to OCR path in references/formats.md |
| Recover | Missing optional dependency (e.g. playwright) | Install the dependency, then retry the conversion |
| Recover | URL fetch returns paywall page | Report fetch limitation; do not retry or attempt bypass |
| Recover | trafilatura returns empty | Escalate to Playwright fetch strategy per references/fetch.md |
result = md.convert(path)
if not result.text_content.strip():
raise ValueError(f"No text extracted from {path}. See references/formats.md for OCR options.")
Never silently return empty Markdown. Surface the failure with the severity and a pointer to the relevant reference file.
html2text internally — complex layouts lose structure. For high-fidelity HTML conversion where DOM structure matters, suggest Turndown via Node subprocess.references/formats.md.include_tables=True to strip boilerplate.