document-extraction | requirements-elicitation | ClaudePluginHub

Skill

document-extraction

From requirements-elicitation

Extracts and categorizes requirements from PDFs, Word docs, transcripts, specs, and web content using pattern matching and outputs structured YAML.

$

npx claudepluginhub melodic-software/claude-code-plugins --plugin requirements-elicitation

Tool Access

This skill is limited to using the following tools:

ReadGlobGrepWriteTaskWebFetch

Preview

Extract requirements from existing documentation sources for systematic requirement mining.

Supporting Assets

references/document-types.mdreferences/extraction-prompts.md

SKILL.md

Similar Skills

extract

40

Extracts requirement candidates from PDFs, Markdown, Word docs, text files, and URLs. Categorizes by type, deduplicates across files, saves in YAML, and provides extraction summary.

6 tools

requirements-elicitation

extract-work-items

15

Extracts work items, user stories, bugs, and actionable tasks from specs, PRDs, research, plans, meeting notes, and design docs into structured meta/work/ files. Auto-activates for capturing requirements from existing documents.

7 files2 tools

extracting-requirements

28

Extracts requirements from human spec collateral using chunking and parallel subagents, producing per-epic files with proof obligations and stable-ID behavior scenarios for iterative development.

7 files

iterative-development

Stats

Parent Repo Stars40

Parent Repo Forks6

Last CommitDec 27, 2025

Actions

View Source View Plugin View on GitHub View README

Tags

requirement-extraction

document-mining

pdf-requirements

transcript-analysis

requirements-categorization

Help us improve

Share bugs, ideas, or general feedback.

Document Extraction Skill

Extract requirements from existing documentation sources for systematic requirement mining.

When to Use This Skill

Keywords: extract requirements, document mining, PDF requirements, transcript analysis, parse document, existing documentation, legacy requirements, competitive analysis

Invoke this skill when:

Mining requirements from existing documents
Processing meeting transcripts for requirements
Extracting requirements from competitor products
Analyzing regulatory documents for compliance requirements
Converting legacy documentation to structured requirements

Supported Document Types

Type	Extension	Extraction Method
Markdown	.md	Direct Read
Text	.txt	Direct Read
PDF	.pdf	Read tool (PDF support)
Word	.docx	Read tool
Web Page	URL	WebFetch tool
Meeting Notes	.md, .txt	Transcript patterns
Specification	.md, .docx	Requirement patterns

Extraction Workflow

Step 1: Document Assessment

Analyze the document to determine extraction strategy:

document_assessment:
  path: "{file path or URL}"
  type: "{detected document type}"
  size: "{approximate size}"
  structure:
    has_sections: true|false
    has_lists: true|false
    has_tables: true|false
  quality:
    formal_language: true|false
    clear_requirements: true|false
    needs_interpretation: true|false

Step 2: Pattern Matching

Apply requirement detection patterns:

Explicit Requirement Markers:

- "The system shall..."
- "The system must..."
- "Users should be able to..."
- "REQ-XXX:"
- Numbered requirements (1.1, 1.2, etc.)

EARS Patterns:

- "When [trigger], the [system] shall [response]"
- "While [state], the [system] shall [behavior]"
- "Where [feature], the [system] shall [behavior]"
- "If [condition], then the [system] shall [response]"

Implicit Requirement Indicators:

- "It is important that..."
- "We need..."
- "The goal is to..."
- "Users expect..."
- "Performance should..."

Step 3: Requirement Extraction

For each identified requirement:

extracted_requirement:
  id: REQ-{sequence}
  text: "{cleaned requirement statement}"
  source: document
  source_file: "{file path}"
  source_location: "{section/page/line}"
  original_text: "{exact text from document}"
  type: functional|non-functional|constraint|assumption
  confidence: high|medium|low
  extraction_method: explicit|pattern|inferred
  needs_review: true|false
  review_notes: "{why review needed}"

Step 4: Categorization

Categorize extracted requirements:

categories:
  functional:
    - features
    - behaviors
    - interactions
  non_functional:
    - performance
    - security
    - usability
    - reliability
    - scalability
  constraints:
    - technical
    - business
    - regulatory
  assumptions:
    - environmental
    - user_behavior
    - dependencies

Step 5: Deduplication

Identify and merge duplicate requirements:

deduplication:
  strategy: semantic_similarity
  threshold: 0.8
  action: merge|flag_for_review
  merged_requirements:
    - id: REQ-merged-001
      sources: [REQ-001, REQ-015]
      text: "{consolidated requirement}"

Document-Specific Strategies

Meeting Transcripts

transcript_extraction:
  focus_on:
    - Action items
    - Decisions made
    - Requirements discussed
    - Concerns raised
  patterns:
    - "We decided that..."
    - "The requirement is..."
    - "Action item:"
    - "TODO:"
    - "Need to..."
  speaker_context:
    - Note who said what
    - Weight by speaker role

Regulatory Documents

regulatory_extraction:
  focus_on:
    - Mandatory requirements ("shall", "must")
    - Prohibited actions ("shall not", "must not")
    - Conditional requirements ("if...then")
  compliance_mapping:
    - Reference section numbers
    - Note effective dates
    - Track version/revision

Competitor Analysis

competitor_extraction:
  focus_on:
    - Feature descriptions
    - User capabilities
    - Unique selling points
  output:
    - Feature requirements
    - Differentiation opportunities
    - Gap identification
  confidence: low  # Based on external observation

Legacy Specifications

legacy_extraction:
  focus_on:
    - Existing requirements
    - System behaviors
    - Integration points
  modernization:
    - Update terminology
    - Convert to EARS format
    - Flag deprecated requirements

Output Format

Per-Document Output

extraction_result:
  source:
    file: "{path or URL}"
    type: "{document type}"
    extraction_date: "{ISO-8601}"
    confidence: high|medium|low

  statistics:
    total_candidates: {number}
    extracted: {number}
    filtered: {number}
    needs_review: {number}

  requirements:
    - id: REQ-{number}
      text: "{requirement}"
      type: functional|non-functional|constraint
      source_location: "{section/page}"
      confidence: high|medium|low
      original_text: "{exact source text}"

  review_items:
    - requirement_id: REQ-{number}
      reason: "{why review needed}"
      suggestion: "{proposed action}"

  metadata:
    sections_processed: {number}
    extraction_patterns_used: ["{pattern names}"]

Autonomy Levels

Guided Mode

guided_behavior:
  document_selection: Human selects
  extraction_strategy: AI suggests, human approves
  each_requirement: AI highlights, human confirms
  categorization: AI suggests, human validates

Semi-Autonomous Mode

semi_auto_behavior:
  document_selection: AI suggests priority, human approves list
  extraction_strategy: AI chooses autonomously
  requirements: AI extracts all, human reviews in batches
  categorization: AI categorizes, human spot-checks

Fully Autonomous Mode

full_auto_behavior:
  document_selection: AI processes all relevant
  extraction_strategy: AI optimizes per document
  requirements: AI extracts, deduplicates, categorizes
  output: Full extraction report for final review

Quality Indicators

High Confidence Extraction

Explicit requirement markers ("shall", "must")
EARS-pattern matches
Numbered requirement lists
Clear imperative statements

Medium Confidence Extraction

Implicit indicators ("should", "needs to")
Context-dependent interpretation
Partial pattern matches
Requires domain knowledge

Low Confidence Extraction

Inferred from descriptions
Narrative text interpretation
Competitive analysis
Assumptions based on context

Delegation

For related tasks, delegate to:

gap-analysis: Check extracted requirements for completeness
domain-research: Research unfamiliar terms or concepts
elicitation-methodology: Route back for technique selection

Output Location

Save extraction results to:

.requirements/{domain}/documents/DOC-{filename}-{timestamp}.yaml

Related

elicitation-methodology - Parent hub skill
gap-analysis - Post-extraction completeness checking
interview-conducting - Clarify extracted requirements with stakeholders