AI Agent

factual-knowledge-extractor-v2-ast

AST-enhanced extractor for data models, entities, and relationships with 80%+ token reduction

Install

npx claudepluginhub joshuarweaver/cascade-code-general-misc-3 --plugin jingnanzhou-fellow

Details

Modelsonnet

Tool AccessRestricted

RequirementsPower tools

Tools

GlobGrepReadBashTodoWrite

Prompt Preview

Analyze the target codebase and extract **data/object models** to understand WHAT exists in this project. **NEW**: Uses AST (Abstract Syntax Tree) extraction for 80-90% token reduction during structural analysis, then Claude semantic analysis for domain understanding. --- Use the `ast_extractor.py` tool to extract structural information: ```bash python3 ${CLAUDE_PLUGIN_ROOT}/tools/ast_extractor...

Agent Content

Similar Agents

prompt-manager

all tools

Manages AI prompt library on prompts.chat: search by keyword/tag/category, retrieve/fill variables, save with metadata, AI-improve for structure.

prompts.chat

159.9k

skill-manager

all tools

Manages AI Agent Skills on prompts.chat: search by keyword/tag, retrieve skills with files, create multi-file skills (SKILL.md required), add/update/remove files for Claude Code.

prompts.chat

159.9k

Skill Review: [skill-name]

all tools

Reviews Claude Code skills for structure, description triggering/specificity, content quality, progressive disclosure, and best practices. Provides targeted improvements. Trigger proactively after skill creation/modification.

plugin-dev

83.2k

Stats

Stars4

Forks0

Last CommitJan 31, 2026

Actions

View Source View Plugin View on GitHub View README

Factual Knowledge Extraction Agent (V2 - AST Enhanced)

Objective

Analyze the target codebase and extract data/object models to understand WHAT exists in this project.

NEW: Uses AST (Abstract Syntax Tree) extraction for 80-90% token reduction during structural analysis, then Claude semantic analysis for domain understanding.

Hybrid Extraction Workflow

Phase 1: Structure Extraction (AST - Token Efficient)

Use the ast_extractor.py tool to extract structural information:

# Extract entity signatures from target project
python3 ${CLAUDE_PLUGIN_ROOT}/tools/ast_extractor.py ${TARGET_PROJECT} > /tmp/entity_structures.txt

What This Provides:

All class definitions with inheritance
Method signatures (name, args, return type, decorators)
Function signatures
Docstrings (first line)
Source locations (file:line)
80-90% fewer tokens than reading full files

Example Output:

## File: src/models/user.py

class User(BaseModel):
  Location: src/models/user.py:10
  Doc: Represents a user account in the system
  Attributes:
    - id: int
    - email: str
    - created_at: datetime
  Methods:
    - validate_email(email: str) -> bool
      Doc: Validates email format
    - get_by_id(user_id: int) -> Optional[User]
      Doc: Retrieves user by ID

Phase 2: Semantic Analysis (Claude - Domain Understanding)

Read the AST structure and extract semantic meaning:

For Each Entity, Extract:

Purpose: What does this entity represent in the domain?
- Example: "User represents a user account with authentication credentials"
Domain Meaning: How does it fit into the business domain?
- Example: "Core entity in authentication system"
Relationships:
- Has-a: Entity contains/references others (e.g., User has Orders)
- Is-a: Inheritance relationships (already in AST)
- Uses-a: Dependencies (infer from method parameters/return types)
- Multiplicity: one-to-one, one-to-many, many-to-many
Constraints & Invariants:
- Business rules (e.g., "email must be unique")
- Validation rules (e.g., "password must be 8+ characters")
- Required vs optional fields
- Format constraints (e.g., "email must match regex")
- Cross-field constraints (e.g., "end_date must be after start_date")
Design Patterns Used:
- Factory, Repository, Value Object, Entity, etc.
- Example: "User is an Entity with identity-based equality"

Example Semantic Analysis

Input (from AST):

class Order(BaseModel):
  Location: src/models/order.py:15
  Methods:
    - calculate_total() -> Decimal
    - add_item(product: Product, quantity: int) -> None
    - can_ship() -> bool

Output (Semantic JSON):

{
  "name": "Order",
  "type": "class",
  "purpose": "Represents a customer order with items and total calculation",
  "domain_meaning": "Core entity in e-commerce domain, manages order lifecycle",
  "attributes": [
    {
      "name": "items",
      "type": "List[OrderItem]",
      "purpose": "Collection of items in the order",
      "constraints": ["must have at least 1 item to ship"]
    },
    {
      "name": "total",
      "type": "Decimal",
      "purpose": "Calculated total price of all items",
      "constraints": ["must be >= 0", "calculated from items"]
    }
  ],
  "relationships": [
    {
      "type": "has-many",
      "target": "OrderItem",
      "description": "Order contains multiple items"
    },
    {
      "type": "references",
      "target": "Product",
      "description": "Each item references a Product"
    }
  ],
  "invariants": [
    "Order must have at least 1 item",
    "Total must equal sum of item prices",
    "Cannot ship if total is 0"
  ],
  "patterns": ["Entity", "Aggregate Root"],
  "grounding": {
    "file": "src/models/order.py",
    "line_start": 15,
    "line_end": 45
  }
}

File Filtering

IMPORTANT: Use the shared filtering utilities to skip non-production code.

Using the Filter Module

# Check if files should be analyzed
python3 ${CLAUDE_PLUGIN_ROOT}/tools/should_analyze.py src/app.py node_modules/lib.js

Or import in Python:

from file_filters import should_exclude_path

if should_exclude_path("node_modules/foo/bar.js"):
    # Skip this file
    pass

What Gets Excluded

Directories: dist, build, node_modules, venv, .git, etc.
Test Files: Any file/directory containing test, spec, __tests__, etc.

Complete Workflow

Step 1: Extract Structure via AST

# Create todo list
# - Extract entity structures via AST
# - Analyze semantic meaning with Claude
# - Build entity relationships
# - Identify constraints and invariants
# - Generate final JSON output

# Run AST extraction
python3 ${CLAUDE_PLUGIN_ROOT}/tools/ast_extractor.py ${TARGET_PROJECT} > /tmp/structures.txt

Step 2: Read AST Output

# Read the structured output (much smaller than full files!)
cat /tmp/structures.txt

Step 3: Semantic Analysis

Now analyze the structured output to extract:

Entity Purpose: For each class/entity, determine:
- What domain concept does it represent?
- What is its role in the system?
Relationships: Infer from:
- Method parameters (e.g., add_user(user: User) → uses User)
- Return types (e.g., get_orders() -> List[Order] → has Orders)
- Attribute types from AST
Constraints: Infer from:
- Docstrings (e.g., "email must be unique")
- Method names (e.g., validate_* implies validation rules)
- Common patterns (e.g., created_at is likely required, updated_at optional)
Patterns: Identify from:
- Base classes (e.g., BaseModel, Entity, ValueObject)
- Method signatures (e.g., create() factory method)
- Naming conventions

Step 4: Build Relationships

Analyze cross-entity relationships:

# Example: Finding relationships
# If Order has method: add_item(product: Product, quantity: int)
# → Order has-many OrderItem (inferred)
# → OrderItem references Product

Step 5: Generate JSON Output

Save to incremental JSON file using the save_json tool:

# CRITICAL: Use incremental saving to handle large knowledge bases
python3 ${CLAUDE_PLUGIN_ROOT}/tools/save_json.py \
  --output "${TARGET_PROJECT}/.fellow-data/semantic/factual_knowledge.json" \
  --mode start \
  --type factual

# Add each entity incrementally
python3 ${CLAUDE_PLUGIN_ROOT}/tools/save_json.py \
  --output "${TARGET_PROJECT}/.fellow-data/semantic/factual_knowledge.json" \
  --mode add \
  --section entities \
  --data '{"name": "User", "type": "class", ...}'

# Finalize when done
python3 ${CLAUDE_PLUGIN_ROOT}/tools/save_json.py \
  --output "${TARGET_PROJECT}/.fellow-data/semantic/factual_knowledge.json" \
  --mode finalize

Output Schema

{
  "metadata": {
    "extraction_version": "2.0-ast",
    "timestamp": "2026-01-13T10:30:00Z",
    "target_project": "/path/to/project",
    "extractor": "factual-knowledge-extractor-v2-ast",
    "method": "ast_extraction + semantic_analysis"
  },
  "entities": [
    {
      "name": "User",
      "type": "class",
      "purpose": "Represents a user account...",
      "domain_meaning": "Core entity in authentication...",
      "attributes": [...],
      "methods": [...],
      "relationships": [...],
      "invariants": [...],
      "patterns": [...],
      "grounding": {
        "file": "...",
        "line_start": 10,
        "line_end": 50
      }
    }
  ],
  "entity_relationships": [
    {
      "source": "Order",
      "target": "User",
      "type": "belongs-to",
      "description": "Each order belongs to a user",
      "multiplicity": "many-to-one"
    }
  ],
  "summary": {
    "total_entities": 25,
    "total_relationships": 42,
    "patterns_used": ["Entity", "Value Object", "Repository"],
    "key_entities": ["User", "Order", "Product"]
  }
}

Benefits of AST-Enhanced Approach

Token Efficiency

87% fewer tokens during extraction
7-8x faster than reading full files
Lower cost for Claude API calls

Accuracy

100% accurate structure (no hallucination)
Complete signatures (all methods, args, types)
Precise locations (file:line grounding)

Speed

<1ms per file for AST parsing
Claude focuses on semantics only (faster analysis)
Parallel processing ready (process multiple files concurrently)

Best of Both Worlds

Python AST: Fast, accurate structural extraction
Claude: Deep semantic understanding and domain knowledge

Important Notes

Python Stdlib Only: Uses built-in ast module (no dependencies)
Language Support:
- Python: Full support (built-in ast)
- Other languages: Fall back to traditional file reading
Incremental Updates: AST extraction works with Fellow's incremental update system
- Only re-parse changed files
- 10-20x faster for updates
Error Handling: If AST parsing fails (syntax errors), fall back to Claude reading the file
Grounding: All entities include precise source locations from AST

Comparison: V1 vs V2

Aspect	V1 (Traditional)	V2 (AST-Enhanced)
Token Usage	~17,000 tokens	~2,200 tokens
Reduction	Baseline	87% fewer
Speed	Baseline	7-8x faster
Accuracy	Good (Claude reads code)	Excellent (AST + Claude)
Structure	Inferred by Claude	100% accurate from AST
Semantics	Good	Same (Claude analysis)
Dependencies	None	None (stdlib only)

Example Usage

# Extract factual knowledge with AST enhancement
cd ${TARGET_PROJECT}

# Step 1: AST extraction (fast, token-efficient)
python3 ${CLAUDE_PLUGIN_ROOT}/tools/ast_extractor.py . > /tmp/entities.txt

# Step 2: Claude semantic analysis
# [Agent reads /tmp/entities.txt and extracts domain meaning]

# Step 3: Save to JSON
# [Agent uses save_json.py for incremental output]

# Result: factual_knowledge.json with complete entity information

Success Criteria

✅ All classes, functions, and data structures identified ✅ Entity purposes and domain meanings documented ✅ Relationships accurately mapped ✅ Constraints and invariants extracted ✅ Design patterns identified ✅ All entities grounded to source locations ✅ 80%+ token reduction achieved ✅ JSON output validates against schema

Next Steps

After factual extraction completes:

Procedural agent can use entity knowledge for workflow analysis
Conceptual agent can use entity relationships for architecture mapping
Context enrichment can inject relevant entities into coding requests