rag_for_git
π·πΊ Π ΡΡΡΠΊΠ°Ρ Π²Π΅ΡΡΠΈΡ: README.ru.md
An agent that automatically reviews pull/merge requests using RAG + a code graph + Claude Code.
What it is
Plain linters catch syntax and style but miss meaning and relationships: a broken
function contract, the impact of a change on its callers, a removed guard, a contradiction
with an existing test. This agent gives an LLM the same context a human reviewer has β
semantic + lexical retrieval over the whole repository, structural code-graph expansion, and
an agentic tool loop β then posts the result back to GitHub as inline comments on diff lines
plus a summary.
A single PR review runs as three stages:
prepare_review (MCP) β analyze (Claude subagents) β publish_review (MCP)
- prepare β
GitHubProvider pulls the PR (base/head SHA) and changed files; changed .py
files are chunked (tree-sitter) and embedded (Voyage) into an ephemeral overlay ref="pr:N";
policy and per-file review units are assembled.
- analyze β the Claude Code skill fans out one subagent per file. Each reasons over the diff
in a tool loop, pulling in whatever code it needs:
search_code, get_related_symbols,
read_file, get_definition, find_callers, get_changed_file_diff.
- publish β a deterministic tail: policy gate (category/severity/confidence/paths) β line
grounding by exact code quote (anti-hallucination) β dedup β assemble (inline vs summary,
suggestion invariants, fingerprint idempotency, comment cap) β post to GitHub β history record
β overlay/session cleanup.
Status: working v1. Target analysis language is Python; VCS is GitHub (behind a
VCSProvider interface). Proven live: it catches real bugs and sees the impact on calling code
and existing tests.
How it works / Architecture
The core is the reviewer/ library, assembled in reviewer/app.py::build_components(settings)
from Settings (pydantic-settings, .env). Entry points are reviewer/entrypoints/cli.py (Click)
and reviewer/entrypoints/mcp_server.py (FastMCP). Three pieces work together:
- RAG (hybrid retrieval). Postgres/ParadeDB stores code chunks with
pgvector (HNSW ANN) and
pg_search (BM25). A query embeds with Voyage, runs both ANN and BM25 search, and the result
lists are merged with Reciprocal Rank Fusion (RRF), then reranked with Voyage rerank-2.5.
- Code graph (SCIP or tree-sitter, Neo4j). Symbols and their relationships live in Neo4j.
The graph orchestrator (
graph/backend.py) picks a backend via GRAPH_BACKEND
(auto|scip|treesitter): SCIP (@sourcegraph/scip-python) gives a precise, type-aware graph
with CALLS + IMPLEMENTS edges; tree-sitter is a fast fallback with CALLS-by-name only.
Retrieval expands the changed symbols 1β2 hops to surface callers/callees/implementations/tests.
- Claude Code plugin via MCP. The
reviewer-mcp server exposes prepare_review,
publish_review, and the agent tools. The Claude Code plugin (plugin/) drives the review: it
calls prepare_review, runs analysis subagents against those MCP tools, then calls
publish_review.
The single key linking RAG and the graph is node_id = "path#fqn" (e.g.
rag/embedder.py#VoyageEmbedder.embed_query). Both the chunk in Postgres and the node in Neo4j use
it, so graph expansion and chunk retrieval are stitched together without any mapping table.
Index freshness: a stable base + a PR overlay. A full reindex of a large repo is expensive, so
the index keeps a persistent base and layers PR changes on top:
ref="base:<branch>" β the persistent index of a tracked branch (e.g. "base:main",
"base:master"). Each tracked branch in REVIEW_BRANCHES has its own isolated index. Updated
incrementally by reviewer index --ref <branch> (only changed files are chunked; only chunks
with a new content_hash are re-embedded β embeddings are reused across branches by hash,
saving Voyage quota).
ref="pr:N" β an ephemeral overlay of just the PR's changed files at its HEAD.
- On a query:
retrieval = (base:<branch> where path β changed) βͺ overlay. For changed files
the agent sees the new version; for everything else, the stable base.
- Multi-branch. A PR is reviewed against the index of its target branch (
base_ref from the
PR). A PR targeting an untracked branch is skipped (prepare_review returns
{"status":"skipped",...}). The code graph (Neo4j :Symbol) is also branch-scoped via a
branch property, with unique constraint (repo, branch, id).