Help us improve
Share bugs, ideas, or general feedback.
From claude-thai-skills
Use this skill for any task involving Thai text in code — word segmentation, Unicode normalization, sorting/collation, search indexing, romanization, truncation, or database/Elasticsearch configuration for Thai. Trigger whenever the user asks to: tokenize / segment Thai (no spaces between words), normalize Thai text (NFC vs decomposed), sort Thai strings correctly, transliterate to roman / Latin, build a Thai search index, fix Thai text that breaks length limits or renders broken glyphs, or pick a Thai NLP library. Also trigger for requests like "ตัดคำภาษาไทย", "Thai word segmentation", "Thai NLP", "ค้นหาภาษาไทย", "Thai romanization", "PyThaiNLP", "Thai sort", "ICU Thai", "Thai collation", "Thai search index", "Thai full text search", or any variation. If the task involves processing Thai text in software, use this skill.
npx claudepluginhub boom-vitt/claude-thai-skills --plugin thai-invoiceHow this skill is triggered — by the user, by Claude, or both
Slash command
/claude-thai-skills:thai-text-processingThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Thai is written without spaces between words, uses combining tone marks, and does not sort by codepoint. Naive code that works for English silently corrupts Thai: `"ฉันกินข้าว".split(" ")` returns one token, `ORDER BY name` puts vowels in the wrong place, and `LEFT(name, 20)` may chop a syllable mid-character and render `◌`. This skill is the checklist of things to fix.
Optimizes Thai token usage in coding agent interactions, compressing Thai text while preserving code, commands, and safety constraints.
Use this skill for any task involving English-Thai or Thai-English translation, localization, or rewording. Trigger whenever the user asks to: translate text between Thai and English, localize content for a Thai audience, render English idioms in Thai (or vice versa), pick the right pronouns and register, choose between keeping a term in English vs ทับศัพท์, or fix awkward translations. Also trigger for requests like "แปลเป็นไทย", "แปลอังกฤษเป็นไทย", "ช่วยแปลหน่อย", "localize Thai", "translate this", "Thai version", or any variation. If the task involves moving meaning between Thai and English at any level — word, sentence, document — use this skill.
Activates ultra-compressed Thai+English mode to cut 60-75% tokens in technical responses while preserving accuracy. Triggers: /pordee full/lite/stop, พอดี, พูดสั้นๆ.
Share bugs, ideas, or general feedback.
Thai is written without spaces between words, uses combining tone marks, and does not sort by codepoint. Naive code that works for English silently corrupts Thai: "ฉันกินข้าว".split(" ") returns one token, ORDER BY name puts vowels in the wrong place, and LEFT(name, 20) may chop a syllable mid-character and render ◌. This skill is the checklist of things to fix.
"ฉันกินข้าว".split(" ") # ["ฉันกินข้าว"] ← one mega-token
This breaks:
ILIKE '%กิน%' mostly works, ILIKE '%ข้าว%' mostly works, but ranking, stemming, autocomplete all need proper tokens)You must run a segmenter before any token-based operation.
| Tool | Language | Notes |
|---|---|---|
PyThaiNLP word_tokenize | Python | De-facto default. Engines: newmm (default, dict-based), longest, attacut (CNN), deepcut (LSTM, slow but accurate) |
| nlpO3 | Python via Rust | Fast newmm-style; permissive license |
ICU BreakIterator | C / Java / Python (pyicu) | Built into many platforms; decent for word-break |
| thai-segmenter | TypeScript | Browser-friendly; dictionary-based |
| Lucene Thai analyzer | Java | Use for Elasticsearch / OpenSearch indexing |
| SentencePiece / BPE | Any | For ML training; not for human-facing tokens |
Picking rules of thumb:
newmm (good enough, fast, MIT)attacut or deepcut (slower)thai-segmenter or call ICU via WebAssemblythai tokenizer) — do not segment client-side and ship tokens to ESSee examples/segmentation.py for runnable code.
Thai displays one "visual character" out of up to 3 codepoints: base consonant + above-vowel + tone mark. The same word can be encoded multiple ways:
import unicodedata
a = "แสง" # may be precomposed
b = unicodedata.normalize("NFD", a) # explicitly decomposed
a == b # False
unicodedata.normalize("NFC", a) == unicodedata.normalize("NFC", b) # True
Rules:
regex library \X (Python) or Intl.Segmenter("th", {granularity: "grapheme"}) (JS).len("ก่") in Python returns 2 (codepoints) — but the user sees 1 character.See examples/normalize.py.
| System | Use case | Lossless? | Example: สวัสดี |
|---|---|---|---|
| RTGS (Royal Thai General System) | Signs, passports, SEO slugs | No (drops tone, vowel length) | sawatdi |
| ISO 11940 | Academic, archives | Yes (one-to-one, ugly) | s̄wạs̄dī |
| ISO 11940-2 | Phonetic, prettier | No | sawatdi |
| IPA | Phonetic transcription | Approximate | /sā.wàt.dīː/ |
Use RTGS for slugs and English-readable transliteration of names. PyThaiNLP exposes pythainlp.transliterate.romanize(text, engine="royin") for RTGS-style output. For deterministic batch slugs, also strip non-[a-z0-9-] after romanization.
Thai does NOT sort by codepoint. The traditional rule: sort by base consonant first, then vowel, then tone mark. Codepoint sort puts leading vowels (เ, แ, โ, ใ, ไ) AFTER consonants, which is wrong.
| Place | How |
|---|---|
| PostgreSQL (ICU build) | name TEXT COLLATE "th-TH-x-icu" |
| MySQL 8 | COLLATE utf8mb4_thai_520_w2 (Unicode 5.20, weights v2) |
| Python (no DB) | pyicu Collator with Locale("th_TH"), or PyThaiNLP's thai_strftime/collate |
| JavaScript | Intl.Collator("th", {sensitivity: "base"}) |
| Elasticsearch | icu_collation_keyword field with language: "th" |
-- PostgreSQL example
CREATE COLLATION thai_icu (provider = icu, locale = 'th-TH');
SELECT name FROM users ORDER BY name COLLATE "th-TH-x-icu";
Default UTF-8 / binary sort is wrong — verify by sorting ["แอน","กิน","โต","ไกล"] and confirming Thai dictionary order, not codepoint order.
Truncating mid-character (codepoint vs grapheme) can leave dangling combining marks rendered as ◌. Rules:
regex library \X. JS: Intl.Segmenter with granularity: "grapheme".… — don't cut mid-syllable.VARCHAR(50) counts bytes in some collations and characters in others. Test before assuming "50 chars" matches user perception.See examples/normalize.py for grapheme-safe truncation.
Elasticsearch / OpenSearch:
{
"mappings": {
"properties": {
"title": {
"type": "text",
"analyzer": "thai",
"search_analyzer": "thai"
}
}
}
}
PostgreSQL full-text: the bundled tsvector configs don't ship a Thai dictionary. Options:
tsvector column with simple configpg_search / paradedb with a Thai tokenizerDon't rely on ILIKE '%term%'. Without segmentation it does substring match — works for very short queries, fails for multi-word, misses synonyms, can't rank.
Modern LLMs (Claude, GPT, Gemini, Llama) use BPE / SentencePiece tokenizers trained primarily on English. Thai gets a rough ride:
โ (leading vowel) ends up in one chunk and its consonant in the next, both render as ? or ◌ in downstream UIs.Practical advice:
word_tokenize or Tokenizer(engine="newmm")) before semantic search, RAG chunking, or embedding generation.chunk_size=512 character splits will fragment compound nouns and tank retrieval.TIS-620 is the pre-Unicode Thai character encoding still found in:
It is a single-byte encoding: bytes 0x00–0x7F are ASCII, bytes 0xA1–0xFB map to Thai characters. Microsoft's variant is Windows-874 / cp874 — TIS-620 plus a few extra symbols (curly quotes, euro sign, etc.).
Pitfalls:
àÅÂä·Â.0xA0 (non-breaking space) as a Thai consonant by mistake — strict TIS-620 decoders raise on it.chardet, charset-normalizer) is unreliable on short Thai strings — confirm by inspecting bytes, not by guessing.Recipe (Python):
# Read a legacy file as TIS-620 or cp874 and write back as UTF-8
with open("legacy.csv", "rb") as f:
raw = f.read()
# Try cp874 first (superset of TIS-620, handles MS exports too)
try:
text = raw.decode("cp874")
except UnicodeDecodeError:
# Fall back, replacing the stray 0xA0 etc.
text = raw.decode("cp874", errors="replace")
with open("clean.csv", "w", encoding="utf-8") as f:
f.write(text)
# Recover double-encoded mojibake (UTF-8 wrapper around TIS-620 bytes)
def fix_double_encoded(s: str) -> str:
return s.encode("latin-1").decode("cp874")
For Pandas: pd.read_csv("legacy.csv", encoding="cp874"). For requests / HTTP: check Content-Type: text/html; charset=tis-620 — old Thai gov sites still serve it.
.split(" ") to count Thai words → always returns 1 — use a segmenterORDER BY name without collation → vowels and tone marks in codepoint order, not dictionary orderWHERE name = 'แสง' misses rows that were saved decomposedLEFT(text, 20) for "first 20 chars" → counts bytes or codepoints, not graphemes; renders ◌.split() doesn't (zero-width space, U+200B) is sometimes injected by CMSs for word break hints — strip on input or normalizechardet to identify TIS-620 vs UTF-8 on short strings → it guesses wrong on names and headlines; check the byte range manually or try cp874 decode first