Skill

post-ocr-cleanup

Guides post-OCR text cleanup for research corpora using LLM correction, rule-based fixes, quality diagnostics, and multilingual handling. For strategy selection, prompt design, evaluation, and assurance.

ai-ml

data-engineering

npx claudepluginhub scdenney/open-science-skills --plugin open-science-skills

Tool Access

This skill uses the workspace's default tool permissions.

Preview

- **Choose between LLM correction, rule-based fixes, or a hybrid pipeline based on error type.** LLM correction excels at context-dependent errors (wrong but plausible characters, broken words, missing diacritics). Rule-based fixes handle deterministic patterns (control characters, Unicode normalization, repetition artifacts, whitespace) with zero risk of content alteration. Use rule-based fixe...

SKILL.md

Similar Skills

vlm-ocr-pipeline

Guides setup/execution of VLM-based OCR pipelines for scanned historical/multilingual documents: model selection, image/DPI handling, prompts, batch processing on HPC/SLURM, accuracy eval with CER/WER.

open-science-skills

ppw:polish

121

Polishes English academic LaTeX text for journal submission via quick-fix or guided multi-pass workflows. Supports in-place editing, change tracking, and journal style adaptation.

paper-polish-workflow

humanizer

Detects AI-generated patterns in Korean text from LLMs like ChatGPT/Claude/Gemini and rewrites to natural human style. Analyzes 40 linguistic markers (comma overuse, spacing rigidity, POS diversity) with S1-S3 severity based on KatFishNet research (94.88% AUC).

14 files1 tool

korean-skills

Stats

Parent Repo Stars13

Parent Repo Forks0

Last CommitApr 5, 2026

Actions

View Source View Plugin View on GitHub View README

Help us improve

Share bugs, ideas, or general feedback.

Post-OCR Text Cleanup for Research Corpora

Instructions

1. Cleanup Strategy Selection

Choose between LLM correction, rule-based fixes, or a hybrid pipeline based on error type. LLM correction excels at context-dependent errors (wrong but plausible characters, broken words, missing diacritics). Rule-based fixes handle deterministic patterns (control characters, Unicode normalization, repetition artifacts, whitespace) with zero risk of content alteration. Use rule-based fixes unconditionally for these categories.
Default to the hybrid approach for research corpora. Run LLM correction first on all pages, then apply deterministic rule fixes on top. This order matters: LLM correction may introduce formatting artifacts that rule fixes clean up, while the reverse order wastes rule-fix effort on text the LLM will rewrite (Machidon & Machidon 2025).
Pilot-test LLM correction per language before corpus-wide deployment. LLM post-correction effectiveness is highly language-dependent: English achieves 7-58% CER reduction, while some languages see no improvement or degradation (Kanerva et al. 2025). Never assume cross-language transferability.
Consider whether correction is needed at all. If the downstream analysis tolerates OCR noise (e.g., topic modeling is robust to moderate error rates), the risk of correction-introduced errors may outweigh the benefit. Define the quality threshold before choosing a strategy.

2. LLM-Based Correction

Use a small text-only model, not the full VLM. The correction input is already text; image understanding is not needed. A 7-8B parameter model with 4-bit quantization requires only ~4GB VRAM and processes pages in 2-3 seconds (Thomas et al. 2024).
Write tight correction prompts. Instruct the model to "fix clear OCR mistakes only: wrong characters, broken words, garbled punctuation, repetition artifacts. Do not translate, modernize, or add anything. Output the corrected text only." Loose prompts invite hallucination.
Add language-specific instructions. For Polish, explicitly mention diacritics restoration (ą, ć, ę, ł, ń, ó, ś, ź, ż). For Korean, mention hangul integrity and hanja preservation. The correction model needs to know which character set to favor.
Mitigate hallucination with constrained decoding. Constrained decoding techniques (logit biasing, grammar-constrained generation) that enforce character-level similarity between input and output prevent the model from inserting plausible but fabricated content (Sastre et al. 2025). At minimum, track and flag pages where correction diverges substantially from the original.
Disable chain-of-thought for correction tasks. Reasoning modes add latency without improving transcription fidelity. Use greedy decoding (do_sample=False) for deterministic output.
Consider multimodal correction for severely degraded pages. Feeding both the original page image and the OCR text to the correction model can achieve below 1% CER on degraded documents, but doubles GPU cost (Greif et al. 2025). Reserve this for flagged pages, not routine processing.
Track all changes with edit-distance metrics. Compute Levenshtein distance and change ratio (edit distance / original length) per page. Flag pages where the correction model altered more than 10% of characters for manual review — high change ratios may indicate hallucination rather than correction (Bourne 2024).

3. Rule-Based Fixes

Apply deterministic fixes in a fixed order. (1) Control character removal, (2) zero-width and invisible Unicode character removal, (3) NFKC Unicode normalization, (4) consecutive character repetition collapse, (5) standalone symbol line removal, (6) whitespace normalization. This ordering prevents interactions between fixes.
Tune repetition collapse thresholds to the corpus. The default of collapsing runs of 4+ identical characters to 3 works for most scripts but may need adjustment for languages with legitimate long character sequences or for documents with intentional formatting patterns.
Rule-based diacritics restoration is viable for some languages. For Polish, rule-based approaches (removing word breaks, rejecting case-changing corrections, restoring diacritical characters replaced with visually similar ASCII) are competitive with LLM-based correction and more predictable (Ogrodniczuk 2022).
Generate synthetic OCR errors for training when ground truth is scarce. Character-level Markov corruption can produce realistic synthetic training data for correction models without requiring hand-annotated ground truth (Guan & Greene 2024). Under-corrupted training data outperforms over-corrupted.
Preserve the raw text alongside every cleaned version. Rule-based fixes are deterministic and reversible, but downstream researchers may prefer different normalization choices. Store both raw and cleaned text at every stage.

4. Quality Diagnostics and Metrics

Move beyond CER/WER as the sole quality measure. Character-level edit distance is sensitive to normalization choices, does not capture semantic correctness, and can disagree across evaluation tools (Beyene & Dancy 2026). A model can scramble column order while achieving perfect CER.
Build a multi-signal diagnostic profile per page. Character composition ratios (diacritics-to-Latin for European scripts, CJK character ratios for East Asian), repetition artifact density, symbol density, and page length anomalies (empty, suspiciously short, suspiciously long) each capture different failure modes.
Use dictionary hit rates as an automated quality proxy. Tokenize OCR output and check against morphological dictionaries or analyzers. Compute per-page and per-document valid-token rates. This scales to every page in the corpus without human effort.
Calibrate thresholds against downstream task requirements. OCR quality directly impacts NER, classification, topic modeling, and other downstream NLP tasks (van Strien et al. 2020). Define acceptable error rates based on what your analysis pipeline can tolerate, not abstract accuracy targets. Researchers in computational humanities should evaluate OCR quality relative to their analytical needs (Backer & Hyman 2025).
Classify pages by recommended action. Map diagnostic signals to action categories: OK (no intervention), rule-fixable (deterministic cleanup sufficient), LLM-fixable (context-dependent errors), manual review (critical failures). This prioritizes human attention on the pages that need it most.

5. Multilingual Considerations

Check diacritic ratios for Latin-script languages. For Polish, a page of body text with zero diacritical characters almost certainly has OCR errors. Flag pages where the diacritic-to-alphabetic ratio drops below a language-calibrated threshold (Ogrodniczuk 2022).
Treat Korean spacing as a distinct post-OCR task. Korean uses space-delimited eojeol units, and OCR frequently merges or splits them incorrectly. Dedicated spacing models (Choi & Kim 2021) outperform general-purpose LLM correction for this specific error type.
Use morphological analyzers as correction validators. Morfeusz for Polish, Mecab-ko for Korean — tokens that parse successfully are likely correct; tokens that fail to parse are OCR error candidates. This provides both a diagnostic signal and a correction filter.
Protect dialectal and archaic text from normalization. Correction models trained on standard modern language may damage historical orthography, dialectal spellings, or archaic vocabulary. Test on a sample of the oldest and most linguistically distinctive documents before corpus-wide deployment (Kanerva et al. 2025; Machidon & Machidon 2025).

6. Corpus-Level Quality Assurance

Implement a three-tier review workflow. (1) Automated pass/fail based on diagnostic thresholds applied to all documents, (2) spot-check review of flagged documents (10-15% of corpus), (3) deep review of a random sample of passing documents (2-5% of corpus) to catch false negatives.
Define specific flagging thresholds. Mean quality score below 0.80, LLM correction change ratio above 10%, dictionary hit rate below 80%, or more than 5% empty pages. Calibrate these thresholds against your pilot evaluation results.
Route flagged documents to specific remediation actions. Re-run LLM cleanup with a different prompt, re-OCR at higher DPI or without quantization, or escalate to manual transcription. Each action has different cost and quality implications.
Produce a corpus-level quality dashboard. Aggregate per-document metrics into a summary CSV or report: document ID, language, page count, mean quality score, dictionary hit rate, correction change ratio, flag status, review outcome. This supports both operational monitoring and methods reporting.

7. Provenance and Documentation

Maintain a complete audit trail of all corrections. Each correction should be attributed to its source (LLM model version, rule name) with before/after text preserved at the page level (Guo & Wei 2026). Correction pathways can substantially alter extracted entities and downstream interpretations.
Log model details for the LLM correction stage. Model name, quantization method, prompt text, generation parameters (temperature, max tokens, sampling), and per-page edit-distance metrics. This is the minimum required to reproduce or audit the correction.
Log rule-fix details. Which rules fired and how many characters each rule changed per page. This enables downstream researchers to assess whether rule fixes were aggressive or conservative for a given document.
Produce a cleanup comparison artifact. A CSV or JSON with page-level before/after text pairs and metrics (edit distance, change ratio, flagged status) enables downstream researchers to assess correction quality and choose their preferred text version.
Record all thresholds and human review outcomes. Sampling decisions, flagging thresholds, remediation actions, and human review results should be documented in a methods section alongside the corpus, not just in internal notes.

post-ocr-cleanup

Tool Access

Preview

SKILL.md

Similar Skills

Help us improve

Help us improve

post-ocr-cleanup

Tool Access

Preview

SKILL.md

Post-OCR Text Cleanup for Research Corpora

Instructions

1. Cleanup Strategy Selection

2. LLM-Based Correction

3. Rule-Based Fixes

4. Quality Diagnostics and Metrics

5. Multilingual Considerations

6. Corpus-Level Quality Assurance

7. Provenance and Documentation

Quality Checks

Similar Skills

Help us improve

Post-OCR Text Cleanup for Research Corpora

Instructions

1. Cleanup Strategy Selection

2. LLM-Based Correction

3. Rule-Based Fixes

4. Quality Diagnostics and Metrics

5. Multilingual Considerations

6. Corpus-Level Quality Assurance

7. Provenance and Documentation

Quality Checks