Skill

extract

General extraction/ingestion skill that routes to specific workflows based on input type. Extracts structured information from documents, emails, reviews, feedback, and other sources.

From aops-cowork

Install

Run in your terminal

npx claudepluginhub nicsuzor/academicops --plugin aops-cowork

Tool Access

This skill is limited to using the following tools:

ReadGrepGlobEditWriteSkillBash

Supporting Assets

View in Repository

procedures/review-inline-comments.md

Skill Content

Similar Skills

agent-harness-construction

Designs and optimizes AI agent action spaces, tool definitions, observation formats, error recovery, and context for higher task completion rates.

ecc

140.3k

agent-payment-x402

Enables AI agents to execute x402 payments with per-task budgets, spending controls, and non-custodial wallets via MCP tools. Use when agents pay for APIs, services, or other agents.

ecc

140.3k

agent-eval

Compares coding agents like Claude Code and Aider on custom YAML-defined codebase tasks using git worktrees, measuring pass rate, cost, time, and consistency.

ecc

140.3k

Stats

Parent Repo Stars0

Parent Repo Forks1

Last CommitApr 2, 2026

Actions

View Source View Plugin View on GitHub View README

Extraction & Ingestion Skill

Taxonomy note: This skill provides domain expertise (HOW) for extracting structured information from documents and sources. See [[TAXONOMY.md]] for the skill/workflow distinction.

General-purpose extraction skill that intelligently routes to specialized workflows based on input type. Extracts structured information from various sources and stores it appropriately (public framework vs. private data).

Framework Context

[[AXIOMS.md]] — including [[AXIOMS.md#P52]] (Read-Then-Write Memory)

Search Before Creating

MANDATORY before creating any new extracted content: search PKB for existing knowledge on the same subject.

mcp__pkb__search(query="[topic/person/document subject]")

If a match exists → augment it rather than creating a duplicate
If people are mentioned → retrieve existing relationship context to enrich extraction
If no match → proceed with creation

This prevents duplicate memories and grounds extraction in accumulated knowledge. See [[remember]] skill's "search first" step as the model.

Purpose

Provide a unified entry point for all extraction tasks:

Training data extraction from feedback documents
Archive information extraction (emails, correspondence)
Decision extraction from task queues
Knowledge extraction from documents
Review pair extraction for LLM training

Workflow Routing

When invoked, analyze the input and route to the appropriate workflow:

1. Training Data Extraction

Signals:

Input contains review feedback + source document
User mentions "training", "extract patterns", "learning", "dataset"
Documents contain tracked changes, comments, or annotations
Goal is to build LLM training data

Route to: workflows/training-data.md

Storage:

Sensitive data (actual review content, source documents): $ACA_DATA/processed/review_training/
Generalized patterns (depersonalized principles): Framework docs or peer-review skill

2. Archive Information Extraction

Signals:

Input is email archive, correspondence, receipts
User mentions "archive", "preserve", "remember"
Goal is to capture significant events/relationships
Source is historical documents

Route to: Existing archived/skills/extractor/SKILL.md logic

Storage: Use Skill(skill="remember") for PKB storage

3. Decision Extraction

Signals:

User mentions "decisions", "pending", "blocking"
Goal is to surface approval/choice items
Source is task queue

Route to: Existing aops-core/skills/decision-extract/SKILL.md

Storage: Daily note with decision formatting

4. Document Knowledge Extraction

Signals:

Single document needs key information extracted
User mentions "extract", "parse", "ingest"
Not training data, not archive, not decisions
Goal is structured information retrieval

Route to: workflows/document-knowledge.md (to be created)

Storage: Depends on content - PKB or framework docs

Workflow: Training Data Extraction

Input Types

Type A: Review with Inline Comments (DOCX with tracked changes)

Example: Peer review with inline comments and suggestions
Procedure: procedures/review-inline-comments.md
Output: Training pairs + generalized principles

Type B: Separate Review + Source Documents

Example: review.txt + source.pdf + metadata.json
Workflow: archived/skills/review-training/SKILL.md
Output: Training pairs matching feedback to source evidence

Type C: Revision History

Example: Git history, Google Docs revision history, track changes
Workflow: workflows/revision-history.md (to be created)
Output: Before/after pairs with change rationales

Extraction Process

See procedures/review-inline-comments.md for detailed procedure.

Quick summary:

Convert to workable format (preserve markup)
Extract feedback units (text + comment pairs)
Categorize feedback (type, scope, action)
Identify patterns (group similar feedback)
Generalize principles (abstract to transferable form)
Separate sensitive/public (raw data vs. patterns)
Store appropriately (sensitive → $ACA_DATA, patterns → framework)

Storage Rules

CRITICAL: Training data often contains sensitive information (author names, unpublished work, specific critiques).

Sensitive data → $ACA_DATA/processed/review_training/{collection_name}/:

extracted_examples.json (full text/feedback pairs)
training_pairs.jsonl (machine-readable format)
collection_summary.md (with identifying information)
Source documents (if retained)

Generalized patterns → Framework (public repo):

aops-core/skills/hydrator/workflows/peer-review.md (update with principles)
aops-core/skills/*/references/ (depersonalized examples)
No names, no specific unpublished content, no identifying details

Quality Standards

High-quality extraction:

Clear connection between feedback and source
Sufficient context for learning
Well-categorized with teaching points
Patterns are transferable

Generalization quality:

Principles are specific enough to apply
Principles are general enough to transfer
Examples span different contexts
Limitations are documented

Workflow: Archive Information Extraction

Delegate to archived/skills/extractor/SKILL.md.

Key principle: Most archival documents have NO long-term value. Be highly selective.

Extract: Concrete outcomes, significant relationships, financial records Skip: Newsletters, invitations, administrative routine, mass communications

Storage: Use Skill(skill="remember") with proper tags and canonical identifiers.

Workflow: Decision Extraction

Delegate to aops-core/skills/decision-extract/SKILL.md.

Key principle: Extract tasks requiring approval/choice that are blocking other work.

Storage: Daily note with formatted decision list for batch processing.

Sensitive Data Handling

What is Sensitive?

Author names and identifying information
Unpublished work content
Specific critiques of individuals' work
Email content and correspondence
Personal information
Institutional confidential information

Storage Location: `$ACA_DATA/processed/`

Directory structure:

$ACA_DATA/processed/
├── review_training/
│   ├── {collection_name}/
│   │   ├── extracted_examples.json
│   │   ├── training_pairs.jsonl
│   │   ├── collection_summary.md
│   │   └── source_documents/
│   └── ...
├── email_archive/
│   └── ...
└── ...

Access: This directory is:

Outside the public academicOps repo
In personal data directory ($ACA_DATA = /home/nic/brain)
Should be backed up separately
Not committed to git

Depersonalization for Public Framework

When adding examples to public framework docs:

Remove all names (use "Author", "Reviewer", or generic placeholders)
Remove specific work titles (use generic descriptions)
Remove institutional affiliations
Generalize to principle, not specific instance
Use constructed examples if real ones can't be depersonalized

Integration with Other Skills

When to use `/extract` vs. specialized skills

Use /extract:

Unclear what type of extraction is needed
Multiple types of documents to process
Want intelligent routing to appropriate workflow

Use specialized skill directly:

/remember - When you know you want to add to knowledge base
/decision-extract - When specifically extracting decisions
/review-training - When processing matched review/source pairs (legacy)

Skill Composition

/extract → analyze input → route to:
  - /remember (for archival preservation)
  - /decision-extract (for pending decisions)
  - training-data workflow (for LLM training data)
  - document-knowledge workflow (for general extraction)

Error Handling

Scenario	Behavior
Unclear input type	Ask user to clarify extraction goal
Cannot convert document format	Try alternative conversion, document failure
Ambiguous feedback	Flag with `"quality": "ambiguous"`, include with caveats
No clear extraction value	Ask user if they want to skip or force extraction
Storage location unclear	Default to `$ACA_DATA/processed/`, confirm with user

Examples

Example 1: Peer Review Extraction

Input: DOCX file with inline comments from peer review

User: /extract /path/to/review.docx --type peer-review

Agent:

Detects inline comments → routes to review-inline-comments workflow
Converts DOCX with pandoc --track-changes=all
Extracts 18 feedback units
Identifies 10 generalisable principles
Stores sensitive data in $ACA_DATA/processed/review_training/aoir2026/
Updates aops-core/skills/hydrator/workflows/peer-review.md with depersonalized principles

Example 2: Email Archive

Input: Directory of email MSG files

User: /extract emails/2025-Q1/ --type archive

Agent:

Detects email archive → routes to extractor skill logic
Processes each email, applies judgment criteria
Extracts significant events/relationships
Uses /remember to store in PKB
Skips 90% of emails as noise

Example 3: Auto-Detection

Input: Mixed documents without type specified

User: /extract documents/

Agent:

Analyzes each document
Detects: 2 peer reviews (tracked changes), 5 emails, 1 grant application
Routes peer reviews → training-data workflow
Routes emails → archive extraction
Asks user about grant application (unclear extraction goal)

Validation Checklist

Before completing extraction:

Completeness:

All extractable items identified
All items processed (or skipped with reason)
Output files created in correct locations

Quality:

Teaching value is clear (for training data)
Categorization is accurate
Context is sufficient

Sensitivity:

Sensitive data stored in $ACA_DATA/processed/
Public framework contains only depersonalized content
No identifying information in public docs

Documentation:

Extraction process documented
Decisions and ambiguities noted
Collection summary created

Future Enhancements

Semi-automated pattern detection
Batch processing for multiple documents
Integration with continuous ingestion pipeline
Quality metrics and validation tools
Cross-collection pattern analysis

extract

extract

Extraction & Ingestion Skill

Framework Context

Search Before Creating

Purpose

Workflow Routing

1. Training Data Extraction

2. Archive Information Extraction

3. Decision Extraction

4. Document Knowledge Extraction

Workflow: Training Data Extraction

Input Types

Type A: Review with Inline Comments (DOCX with tracked changes)

Type B: Separate Review + Source Documents

Type C: Revision History

Extraction Process

Storage Rules

Quality Standards

Workflow: Archive Information Extraction

Workflow: Decision Extraction

Sensitive Data Handling

What is Sensitive?

Storage Location: $ACA_DATA/processed/

Depersonalization for Public Framework

Integration with Other Skills

When to use /extract vs. specialized skills

Skill Composition

Error Handling

Examples

Example 1: Peer Review Extraction

Example 2: Email Archive

Example 3: Auto-Detection

Validation Checklist

Future Enhancements

Extraction & Ingestion Skill

Framework Context

Search Before Creating

Purpose

Workflow Routing

1. Training Data Extraction

2. Archive Information Extraction

3. Decision Extraction

4. Document Knowledge Extraction

Workflow: Training Data Extraction

Input Types

Type A: Review with Inline Comments (DOCX with tracked changes)

Type B: Separate Review + Source Documents

Type C: Revision History

Extraction Process

Storage Rules

Quality Standards

Workflow: Archive Information Extraction

Workflow: Decision Extraction

Sensitive Data Handling

What is Sensitive?

Storage Location: $ACA_DATA/processed/

Depersonalization for Public Framework

Integration with Other Skills

When to use /extract vs. specialized skills

Skill Composition

Error Handling

Examples

Example 1: Peer Review Extraction

Example 2: Email Archive

Example 3: Auto-Detection

Validation Checklist

Future Enhancements

Storage Location: `$ACA_DATA/processed/`

When to use `/extract` vs. specialized skills

Storage Location: `$ACA_DATA/processed/`

When to use `/extract` vs. specialized skills