OpenKB — Open LLM Knowledge Base
Scale to long documents • Reasoning-based retrieval • Native multi-modality • No Vector DB
📑 What is OpenKB
OpenKB (Open Knowledge Base) is an open-source system (in CLI) that compiles raw documents into a structured, interlinked wiki-style knowledge base using LLMs, powered by PageIndex for vectorless long document retrieval.
The idea is based on a concept described by Andrej Karpathy: LLMs generate summaries, concept pages, and cross-references, all maintained automatically. Knowledge compounds over time instead of being re-derived on every query.
Why not traditional RAG?
Traditional RAG rediscovers knowledge from scratch on every query. Nothing accumulates. OpenKB compiles knowledge once into a persistent wiki, then keeps it current. Cross-references already exist. Contradictions are flagged. Synthesis reflects everything consumed.
Features
- Broad format support — PDF, Word, Markdown, PowerPoint, HTML, Excel, text, and more via markitdown
- Scale to long documents — Long and complex documents are handled via PageIndex tree indexing, enabling accurate, vectorless long-context retrieval
- Native multi-modality — Retrieves and understands figures, tables, and images, not just text
- Compiled Wiki — LLM manages and compiles your documents into summaries, concept pages, and cross-links, all kept in sync
- Query — Ask questions (one-off) against your wiki. The LLM navigates your compiled knowledge to answer
- Interactive Chat — Multi-turn conversations with persisted sessions you can resume across runs
- Lint — Health checks find contradictions, gaps, orphans, and stale content
- Watch mode — Drop files into
raw/, wiki updates automatically
- Obsidian compatible — Wiki is plain
.md files with [[wikilinks]]. Open in Obsidian for graph view and browsing
🚀 Getting Started
Install
pip install openkb
Other install options
-
Latest from GitHub:
pip install git+https://github.com/VectifyAI/OpenKB.git
-
Install from source (editable, for development):
git clone https://github.com/VectifyAI/OpenKB.git
cd OpenKB
pip install -e .
Quick Start
# 1. Create a directory for your knowledge base
mkdir my-kb && cd my-kb
# 2. Initialize the knowledge base
openkb init
# 3. Add documents
openkb add paper.pdf
openkb add ~/papers/ # Add a whole directory
openkb add https://arxiv.org/pdf/2509.11420 # Or fetch from a URL
# 4. Ask a question
openkb query "What are the main findings?"
# 5. Or chat interactively
openkb chat
Set up your LLM
OpenKB comes with multi-LLM support (e.g., OpenAI, Claude, Gemini) via LiteLLM (pinned to a safe version).
Set your model during openkb init, or in .openkb/config.yaml, using provider/model LiteLLM format (like anthropic/claude-sonnet-4-6). OpenAI models can omit the prefix (like gpt-5.4).
Create a .env file with your LLM API key:
LLM_API_KEY=your_llm_api_key
🧩 How OpenKB Works
Architecture
raw/ You drop files here
│
├─ Short docs ──→ markitdown ──→ LLM reads full text
│ │
├─ Long PDFs ──→ PageIndex ────→ LLM reads document trees
│ │
│ ▼
│ Wiki Compilation (using LLM)
│ │
▼ ▼
wiki/
├── index.md Knowledge base overview
├── log.md Operations timeline
├── AGENTS.md Wiki schema (LLM instructions)
├── sources/ Full-text conversions
├── summaries/ Per-document summaries
├── concepts/ Cross-document synthesis ← the good stuff
├── explorations/ Saved query results
└── reports/ Lint reports
Short vs. Long Document Handling
| Short documents | Long documents (PDF ≥ 20 pages) |
|---|
| Convert | markitdown → Markdown | PageIndex → tree index + summaries |
| Images | Extracted inline (pymupdf) | Extracted by PageIndex |
| LLM reads | Full text | Document trees |
| Result | summary + concepts | summary + concepts |