Changelog
All notable changes to kham. Follows Keep a Changelog and Semantic Versioning.
Added — kham-core data
- NE gazetteer expanded — 36,668 → 38,950 entries (+2,280 from
pythainlp/thainer-corpus-v2, CC0): +403 PERSON, +793 PLACE, +1,084 ORG. - POS table expanded — 8,993 → 11,404 entries (+2,407 from
UD_Thai-PUD, CC-BY-SA-3.0): +884 PROPN, +665 VERB, +662 NOUN, +134 ADJ, and more. UPOS mapped to kham 13-category scheme. wiki_freq.tsv— new supplemental word-frequency file (CC-BY-SA-4.0) from 500 Thai Wikipedia articles; 19,890 entries. Kept separate from CC0tnc_freq.txt; not yet loaded by FreqMap.
Fixed
- Docker Hub image tags — corrected tags to use
v-prefixed format (e.g.v0.8.1); the docs previously showed tags without thevprefix.
Added — kham-core
Token::confidence: f32— segmentation confidence score on every token;0.0for Unknown tokens,1.0for unambiguous dict matches; intermediate values reflect TNC frequency and boundary ambiguity. Propagated intoFtsToken::confidenceand all bindings.TokenStream/Tokenizer::segment_stream(text)— streaming iterator withnext_word(),next_known(), andnext_above_confidence(f32).SpellChecker::did_you_mean(word)— returnsNoneif the word is in the dictionary,Some(best_match)otherwise.SpellChecker::correct_text(text)— segments the input and replaces every Unknown token (≥ 2 chars) with its best spelling correction; known tokens pass through unchanged.RomanizationMap::romanize_sentence(text)— segments and RTGS-romanizes every Thai/Named token; non-Thai tokens (numbers, Latin, punctuation, whitespace) pass through as-is.KeyExtractor::extract_phrases(text, max_n)— bigram and trigram keyphrases from adjacent content tokens, scored by TF × average-IDF; complements the existingextract()unigram extraction.
Added — bindings (Python / WASM / C FFI)
spell_did_you_mean/kham_spell_did_you_mean— single-word correction check exposed in all three bindings.spell_correct_text/kham_spell_correct_text— full-text spell correction exposed in all three bindings.romanize_sentence/kham_romanize_sentence— sentence-level RTGS romanization exposed in all three bindings.extract_phrases/kham_extract_phrases— keyphrase extraction exposed in all three bindings.segment_above_confidence(text, min_confidence)(Python / WASM) — convenience filter returning only tokens at or above the confidence threshold viasegment_stream.confidencefield on Token — exposed in Python, WASM, and C FFI (KhamToken) as afloat/f32.
Added — kham-cli
--confidence— appendconf=X.XXper token in text output mode.--min-confidence <MIN>— filter output to tokens with confidence ≥ MIN viasegment_stream.--format text|json|csv— structured output for both basic and FTS modes.--romanize— segment and romanize Thai text to RTGS Latin; non-Thai tokens pass through.--spell— spell-check mode: ranked suggestions for the input word using Levenshtein + phonetic re-ranking.--keywords— keyword mode: top keywords and keyphrases from the text with TF × IDF scores.--top-n <N>— max results in--spellor--keywordsmode (default: 10).
Added — kham-pg
- Stopword suppression — Thai grammatical particles (กับ, ใน, ของ, …) are suppressed by
kham_fts_dictand excluded from the tsvector, reducing index noise without any configuration. - Thai number normalization — Thai digit strings (๑๒๓) are now indexed alongside their ASCII equivalent (123) as colocated lexemes; cross-script numeric queries like
plainto_tsquery('kham', '123')match documents containing ๑๒๓ automatically. - POS lexeme expansion — tokens with a known part of speech emit a colocated
pos_<tag>lexeme (e.g.pos_noun,pos_verb); filter by POS with'pos_verb'::tsquery. kham_fts_dict_udom83andkham_fts_dict_metasound— two new dictionary variants backed by the udom83 and MetaSound soundex algorithms; swap dictionaries in custom FTS configurations for finer phonetic discrimination.kham_tsvector(text)andkham_tsquery(text)— SQL STABLE convenience helpers; shorthand forto_tsvector('kham', …)andplainto_tsquery('kham', …).StopwordSet::builtin_with_extra(extra)(kham-core) — combines the built-in 1 029-word stopword list with caller-supplied domain words in one call.FtsTokenizer::segment_stream(text)(kham-core) — returns anFtsTokenStreamiterator withnext_index_token()to advance past stopwords automatically.
Changed — kham-pg
- number token mapping — the built-in
khamconfiguration now routesnumbertokens throughkham_fts_dictinstead ofkham_dict, enabling Thai digit normalization. UseALTER EXTENSION kham_pg UPDATEto apply the change to existing databases.
Added — kham-core
- SpellChecker —
SpellChecker::builtin().suggestions(word, n)returns ranked candidates (Levenshtein ≤ 2, lk82 soundex match, TNC frequency score). - KeyExtractor —
KeyExtractor::builtin().extract(text, n)returns top-N keywords by TF × IDF-proxy score with stopword suppression. FtsTokenizerBuilder::dict_merge()— overlays extra words on the built-in dictionary when constructing anFtsTokenizer(fast overlay path, no trie rebuild).
Added — bindings (WASM / Python / C FFI)
- spell_suggestions / kham_spell_suggestions — exposed in
kham-wasm,kham-python, andkham-capiwith rich result types (SpellSuggestion/KhamSpellList). - extract_keywords / kham_keywords — exposed in all three bindings with
Keyword/KhamKeywordListresult types. - Live demo — Spell & Keywords tabs — interactive browser demo for spell checking and keyword extraction powered by WASM.
Added — kham-sqlite
- Custom synonym map —
synonyms '<path>'tokenize argument loads a TSV synonym file at table-creation time. Synonyms are emitted asFTS5_TOKEN_COLOCATEDso queries match canonical and synonym forms. - Custom dictionary overlay —
dict '<path>'tokenize argument overlays domain-specific words on the built-in dictionary without a full trie rebuild. - Integration test suite — 31 tests covering basic MATCH, RTGS romanization, lk82 soundex,
snippet()/highlight(), stopword filtering, mixed script, NE recognition, all config options, custom synonyms and dict. - Windows build support —
build.rsnow detects Windows and resolves SQLite headers via vcpkg (VCPKG_ROOT) orSQLITE_INCLUDE_DIRoverride. Pre-builtkham_sqlite.dllshipped in release assets. - Android NDK build — CI release workflow cross-compiles
libkham_sqlite.sofor all 4 Android ABIs (arm64-v8a,armeabi-v7a,x86_64,x86) via Android NDK. Requires sqlite-android or SQLCipher (system SQLite hasload_extensiondisabled).
Fixed — kham-sqlite
- Trigrams not emitted —
FtsToken::trigramsfor Unknown tokens was populated by the FTS pipeline but never forwarded to SQLite as colocated tokens. OOV n-gram search now works correctly.
Fixed
- WASM number overflow — corrected u32 overflow for large numbers in
kham-wasmbinding. - NE tag correction —
ประเทศไทยre-tagged from PERSON → PLACE.
Added — kham-pg
- kham_fts_dict — custom dictionary template expanding each Thai/Named token to up to 6 lexemes: normalised word, lk82 Thai Soundex code, and RTGS romanization. Enables phonetic-fuzzy and Latin-script romanization search without schema changes.
- ts_headline support —
kham_headlinecallback registered as HEADLINE function. Marks matching tokens with configurable StartSel/StopSel. 5 regress tests added. - Named entity token type —
TokenKind::Named(_) → 7(named) registered inkham_lextypes.
Fixed — kham-pg
- PG 16+ lexize calling convention — arg3 in
kham_dict_lexizeis aList*on PG 16+, not a bool. Previously caused every token to be silently discarded as a stopword, producing empty tsvectors for all Thai text.
Added
- Thai phonetic encoding —
soundexmodule:lk82,udom83,metasound,thai_english_soundex. Unifiedsoundex(word, SoundexAlgorithm)dispatch. - CLI
--soundexflag — phonetic code appears insyn=FTS output field. - Accuracy benchmark —
kham-bench-accuracybinary: word-boundary P/R/F1 against testdata,--thresholdCI gate. - PyThaiNLP comparison script —
scripts/compare_pythainlp.pywith--export-testdataand--agreedmodes. - Data expansions — NE gazetteer +17,240 Wikipedia entries + +8,980 Thai family names → 36,600 total. POS table +8,691 ORCHID entries → ~9,000 total. TNC frequency +2,410 entries.
- Build size —
tnc_freq.txt,ne_th.tsv,pos_th.tsvzlib-compressed at compile time viabuild.rs.
Changed — breaking
- Compound-first DP scoring —
DpScorefield order changed: minimising token count is now priority 2 (above dict-word maximisation). Fixes systematic over-segmentation. Micro F1 improved from 0.418 → 0.975, sentence agreement vs PyThaiNLP newmm: 1/39 → 37/39 (94.9%).
Added
- abbrev module —
AbbrevMapwith 118-entry built-in TSV (months, era markers, ranks, agencies, Bangkok districts). Greedy longest-first pre-tokenisation expansion. - date module — Thai date normalization: 7 input formats, Buddhist Era + Gregorian, ISO 8601 output.
- sentence module — Thai sentence segmentation: Thai terminators, Paiyannoi, universal punctuation, decimal/abbreviation-aware dot rules.
- FTS pipeline —
FtsTokenizerBuilder::abbrevs()opt-in abbreviation expansion.
Added
- Named Entity Recognition —
NeTagger: gazetteer-based, greedy longest-match, up to 5 consecutive tokens. Built-in NE gazetteer: 10,488 entries.TokenKind::Named(NamedEntityKind)— Person / Place / Org. - POS Tagging —
PosTagger: lookup-based,pos_th.tsvwith 338 entries, 13 ORCHID-derived categories (NOUN VERB ADJ ADV PART PROPN PRON NUM CLAS CONJ AUX DET PREP). - RTGS Romanization —
RomanizationMap: 415-entry table-driven Thai → Roman mapping. Opt-in via FTS pipeline builder. - Number normalization —
thai_digits_to_ascii,parse_thai_word,u64_to_thai_word,parse_thai_baht. - SQLite FTS5 extension —
kham-sqliteloadable extension: full NLP pipeline, byte-accurate offsets forhighlight()andsnippet(). - FTS pipeline —
FtsTokenizerbuilder wiring POS, NE, romanization, stopwords, synonym expansion in a single pass.
v0.1.0 – v0.1.3 2026-04-18 – 2026-04-19
v0.1.3 — PostgreSQL pg_regress test suite: 67 tests across 4 suites (kham_fts, kham_thai, kham_operators, kham_ranking).
v0.1.2 — PostgreSQL FTS extension (
kham-pg): Thai text search parser for PG 17, 6 token types, make install + Docker regress. FTS modules: stopwords, synonyms, ngrams, FtsTokenizer. C FFI kham_fts_lexemes().v0.1.1 — Dual MIT OR Apache-2.0 licensing.
scripts/deploy.sh. pyproject.toml for maturin builds.v0.1.0 — Initial release: DAG-based newmm segmentation, DARTS dictionary, TCC boundary detection, Unicode pre-tokenizer, zero-copy Token, Python (PyO3) and WASM (wasm-bindgen) bindings, C FFI (cbindgen), CLI, criterion benchmarks.