Skill

kordoc-korean-document-parser

Parses HWP, HWPX, and PDF Korean documents to Markdown and structured IRBlock data via CLI, programmatic API, or MCP server. Handles proprietary HWP binary formats, tables, form fields, document diffing, and reverse Markdown→HWPX generation.

TypeScript

Node

developer-tools

Popularity

Stars

Forks

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/aradotso-trending-skills-37:kordoc-korean-document-parser

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

> Skill by [ara.so](https://ara.so) — Daily 2026 Skills collection.

SKILL.md

395 lines · ~2.5k tokens

Stats

Stars57

Forks12

MaintenanceExcellent

Last CommitMay 5, 2026

Actions

View Source View Plugin View on GitHub View README

kordoc Korean Document Parser

Skill by ara.so — Daily 2026 Skills collection.

kordoc is a TypeScript library and CLI for parsing Korean government documents (HWP 5.x, HWPX, PDF) into Markdown and structured IRBlock[] data. It handles proprietary HWP binary formats, table extraction, form field recognition, document diffing, and reverse Markdown→HWPX generation.

Installation

# Core library
npm install kordoc

# PDF support (optional peer dependency)
npm install pdfjs-dist

# CLI (no install needed)
npx kordoc document.hwpx

Core API

Auto-detect and Parse Any Document

import { parse } from "kordoc"
import { readFileSync } from "fs"

const buffer = readFileSync("document.hwpx")
const result = await parse(buffer.buffer) // ArrayBuffer required

if (result.success) {
  console.log(result.markdown)   // string: full Markdown
  console.log(result.blocks)     // IRBlock[]: structured data
  console.log(result.metadata)   // { title, author, createdAt, pageCount, ... }
  console.log(result.outline)    // OutlineItem[]: document structure
  console.log(result.warnings)   // ParseWarning[]: skipped elements
} else {
  console.error(result.error)    // string message
  console.error(result.code)     // ErrorCode: "ENCRYPTED" | "ZIP_BOMB" | "IMAGE_BASED_PDF" | ...
}

Format-Specific Parsers

import { parseHwpx, parseHwp, parsePdf, detectFormat } from "kordoc"

// Detect format first
const fmt = detectFormat(buffer.buffer) // "hwpx" | "hwp" | "pdf" | "unknown"

// Parse by format
const hwpxResult = await parseHwpx(buffer.buffer)
const hwpResult  = await parseHwp(buffer.buffer)
const pdfResult  = await parsePdf(buffer.buffer)

Parse Options

import { parse, ParseOptions } from "kordoc"

const result = await parse(buffer.buffer, {
  pages: "1-3",          // page range string
  // pages: [1, 5, 10], // or specific page numbers
  ocr: async (pageImage, pageNumber, mimeType) => {
    // Pluggable OCR for image-based PDFs
    // pageImage: ArrayBuffer of the page image
    return await myOcrService.recognize(pageImage)
  }
})

Working with IRBlocks

import type { IRBlock, IRBlockType, IRTable, IRCell } from "kordoc"

// IRBlock types: "heading" | "paragraph" | "table" | "list" | "image" | "separator"
for (const block of result.blocks) {
  if (block.type === "heading") {
    console.log(`H${block.level}: ${block.text}`)
    console.log(block.bbox)       // { x, y, width, height, page }
  }

  if (block.type === "table") {
    const table = block as IRTable
    for (const row of table.rows) {
      for (const cell of row) {
        console.log(cell.text, cell.colspan, cell.rowspan)
      }
    }
  }

  if (block.type === "paragraph") {
    console.log(block.text)
    console.log(block.style)      // InlineStyle: { bold, italic, fontSize, ... }
    console.log(block.pageNumber)
  }
}

Convert Blocks Back to Markdown

import { blocksToMarkdown } from "kordoc"

const markdown = blocksToMarkdown(result.blocks)

Document Comparison

import { compare } from "kordoc"

const bufA = readFileSync("v1.hwp").buffer
const bufB = readFileSync("v2.hwpx").buffer  // cross-format supported

const diff = await compare(bufA, bufB)

console.log(diff.stats)
// { added: 3, removed: 1, modified: 5, unchanged: 42 }

for (const d of diff.diffs) {
  // d.type: "added" | "removed" | "modified" | "unchanged"
  // d.blockA, d.blockB: IRBlock
  // d.cellDiffs: CellDiff[] for table blocks
  console.log(d.type, d.blockA?.text ?? d.blockB?.text)
}

Form Field Extraction

import { parse, extractFormFields } from "kordoc"

const result = await parse(buffer.buffer)
if (result.success) {
  const form = extractFormFields(result.blocks)

  console.log(form.confidence)  // 0.0–1.0
  for (const field of form.fields) {
    // { label: "성명", value: "홍길동", row: 0, col: 0 }
    console.log(`${field.label}: ${field.value}`)
  }
}

Markdown → HWPX Generation

import { markdownToHwpx } from "kordoc"
import { writeFileSync } from "fs"

const markdown = `
# 제목

본문 내용입니다.

| 구분 | 내용 |
| --- | --- |
| 항목1 | 값1 |
| 항목2 | 값2 |
`

const hwpxBuffer = await markdownToHwpx(markdown)
writeFileSync("output.hwpx", Buffer.from(hwpxBuffer))

CLI Usage

# Basic conversion — output to stdout
npx kordoc document.hwpx

# Save to file
npx kordoc document.hwp -o output.md

# Batch convert all PDFs to a directory
npx kordoc *.pdf -d ./converted/

# JSON output with blocks + metadata
npx kordoc report.hwpx --format json

# Parse specific pages only
npx kordoc report.hwpx --pages 1-3

# Watch mode — auto-convert new files
npx kordoc watch ./incoming -d ./output

# Watch with webhook notification on conversion
npx kordoc watch ./docs --webhook https://api.example.com/hook

MCP Server Setup

Add to your MCP config (Claude Desktop, Cursor, Windsurf):

{
  "mcpServers": {
    "kordoc": {
      "command": "npx",
      "args": ["-y", "kordoc-mcp"]
    }
  }
}

Available MCP Tools

Tool	Description
`parse_document`	Parse HWP/HWPX/PDF → Markdown + metadata + outline + warnings
`detect_format`	Detect file format via magic bytes
`parse_metadata`	Extract only metadata (fast, no full parse)
`parse_pages`	Parse a specific page range
`parse_table`	Extract the Nth table from a document
`compare_documents`	Diff two documents (cross-format supported)
`parse_form`	Extract form fields as structured JSON

TypeScript Types Reference

import type {
  // Results
  ParseResult, ParseSuccess, ParseFailure,
  ErrorCode,        // "ENCRYPTED" | "ZIP_BOMB" | "IMAGE_BASED_PDF" | ...

  // Blocks
  IRBlock, IRBlockType, IRTable, IRCell, CellContext,

  // Metadata & structure
  DocumentMetadata, OutlineItem,
  ParseWarning, WarningCode,
  BoundingBox,      // { x, y, width, height, page }
  InlineStyle,      // { bold, italic, fontSize, color, ... }

  // Options
  ParseOptions, FileType,
  OcrProvider,      // async (image, pageNum, mime) => string
  WatchOptions,

  // Diff
  DiffResult, BlockDiff, CellDiff, DiffChangeType,

  // Forms
  FormField, FormResult,
} from "kordoc"

Common Patterns

Batch Process Files with Error Handling

import { parse, detectFormat } from "kordoc"
import { readFileSync } from "fs"
import { glob } from "glob"

const files = await glob("./docs/**/*.{hwp,hwpx,pdf}")

for (const file of files) {
  const buffer = readFileSync(file)
  const fmt = detectFormat(buffer.buffer)

  if (fmt === "unknown") {
    console.warn(`Skipping unknown format: ${file}`)
    continue
  }

  const result = await parse(buffer.buffer)

  if (!result.success) {
    if (result.code === "ENCRYPTED") {
      console.warn(`Encrypted, skipping: ${file}`)
    } else if (result.code === "IMAGE_BASED_PDF") {
      console.warn(`Image-based PDF needs OCR: ${file}`)
    } else {
      console.error(`Failed: ${file} — ${result.error}`)
    }
    continue
  }

  console.log(`Parsed ${file}: ${result.blocks.length} blocks`)
}

Extract All Tables from a Document

import { parse } from "kordoc"
import type { IRTable } from "kordoc"

const result = await parse(buffer.buffer)
if (result.success) {
  const tables = result.blocks.filter(b => b.type === "table") as IRTable[]

  tables.forEach((table, i) => {
    console.log(`\n--- Table ${i + 1} ---`)
    for (const row of table.rows) {
      const cells = row.map(cell => cell.text.trim()).join(" | ")
      console.log(`| ${cells} |`)
    }
  })
}

OCR with Tesseract.js

import { parse } from "kordoc"
import Tesseract from "tesseract.js"

const result = await parse(buffer.buffer, {
  ocr: async (pageImage, pageNumber, mimeType) => {
    const blob = new Blob([pageImage], { type: mimeType })
    const url = URL.createObjectURL(blob)
    const { data } = await Tesseract.recognize(url, "kor+eng")
    URL.revokeObjectURL(url)
    return data.text
  }
})

Watch Mode Programmatic API

import { watch } from "kordoc"

const watcher = watch("./incoming", {
  output: "./converted",
  webhook: process.env.WEBHOOK_URL,
  onFile: async (file, result) => {
    if (result.success) {
      console.log(`Converted: ${file}`)
    }
  }
})

// Stop watching
watcher.stop()

Troubleshooting

buffer.buffer vs Buffer — kordoc requires ArrayBuffer, not Node.js Buffer. Always pass readFileSync("file").buffer or use .buffer on a Uint8Array.

PDF tables not detected — Line-based detection requires pdfjs-dist installed. Install it: npm install pdfjs-dist. For borderless tables, kordoc uses cluster-based heuristics automatically.

"IMAGE_BASED_PDF" error — The PDF contains scanned images with no text layer. Provide an ocr function in parse options.

"ENCRYPTED" error — HWP DRM/password-protected files cannot be parsed without the decryption key. No workaround.

Korean characters garbled in output — Ensure your terminal/file uses UTF-8 encoding. kordoc outputs UTF-8 Markdown by default.

Large files are slow — Use pages option to parse only needed pages: parse(buf, { pages: "1-5" }). Metadata-only extraction is faster: parse_metadata MCP tool or check result.metadata directly.

HWP table columns wrong — Update to v1.6.1+. Earlier versions had a 2-byte offset misalignment in LIST_HEADER parsing causing column explosion.

kordoc-korean-document-parser

Popularity

Invocation

Context Preview

SKILL.md

kordoc-korean-document-parser

Popularity

Invocation

Context Preview

SKILL.md

kordoc Korean Document Parser

Installation

Core API

Auto-detect and Parse Any Document

Format-Specific Parsers

Parse Options

Working with IRBlocks

Convert Blocks Back to Markdown

Document Comparison

Form Field Extraction

Markdown → HWPX Generation

CLI Usage

MCP Server Setup

Available MCP Tools

TypeScript Types Reference

Common Patterns

Batch Process Files with Error Handling

Extract All Tables from a Document

OCR with Tesseract.js

Watch Mode Programmatic API

Troubleshooting

Similar Skills

kordoc Korean Document Parser

Installation

Core API

Auto-detect and Parse Any Document

Format-Specific Parsers

Parse Options

Working with IRBlocks

Convert Blocks Back to Markdown

Document Comparison

Form Field Extraction

Markdown → HWPX Generation

CLI Usage

MCP Server Setup

Available MCP Tools

TypeScript Types Reference

Common Patterns

Batch Process Files with Error Handling

Extract All Tables from a Document

OCR with Tesseract.js

Watch Mode Programmatic API

Troubleshooting

Similar Skills