Changelog
All notable changes to kham. Follows Keep a Changelog and Semantic Versioning.
Added — kham-pg
- kham_fts_dict — custom dictionary template expanding each Thai/Named token to up to 6 lexemes: normalised word, lk82 Thai Soundex code, and RTGS romanization. Enables phonetic-fuzzy and Latin-script romanization search without schema changes.
- ts_headline support —
kham_headlinecallback registered as HEADLINE function. Marks matching tokens with configurable StartSel/StopSel. 5 regress tests added. - Named entity token type —
TokenKind::Named(_) → 7(named) registered inkham_lextypes.
Fixed — kham-pg
- PG 16+ lexize calling convention — arg3 in
kham_dict_lexizeis aList*on PG 16+, not a bool. Previously caused every token to be silently discarded as a stopword, producing empty tsvectors for all Thai text.
Added
- Thai phonetic encoding —
soundexmodule:lk82,udom83,metasound,thai_english_soundex. Unifiedsoundex(word, SoundexAlgorithm)dispatch. - CLI
--soundexflag — phonetic code appears insyn=FTS output field. - Accuracy benchmark —
kham-bench-accuracybinary: word-boundary P/R/F1 against testdata,--thresholdCI gate. - PyThaiNLP comparison script —
scripts/compare_pythainlp.pywith--export-testdataand--agreedmodes. - Data expansions — NE gazetteer +17,240 Wikipedia entries + +8,980 Thai family names → 36,600 total. POS table +8,691 ORCHID entries → ~9,000 total. TNC frequency +2,410 entries.
- Build size —
tnc_freq.txt,ne_th.tsv,pos_th.tsvzlib-compressed at compile time viabuild.rs.
Changed — breaking
- Compound-first DP scoring —
DpScorefield order changed: minimising token count is now priority 2 (above dict-word maximisation). Fixes systematic over-segmentation. Micro F1 improved from 0.418 → 0.975, sentence agreement vs PyThaiNLP newmm: 1/39 → 37/39 (94.9%).
Added
- abbrev module —
AbbrevMapwith 118-entry built-in TSV (months, era markers, ranks, agencies, Bangkok districts). Greedy longest-first pre-tokenisation expansion. - date module — Thai date normalization: 7 input formats, Buddhist Era + Gregorian, ISO 8601 output.
- sentence module — Thai sentence segmentation: Thai terminators, Paiyannoi, universal punctuation, decimal/abbreviation-aware dot rules.
- FTS pipeline —
FtsTokenizerBuilder::abbrevs()opt-in abbreviation expansion.
Added
- Named Entity Recognition —
NeTagger: gazetteer-based, greedy longest-match, up to 5 consecutive tokens. Built-in NE gazetteer: 10,488 entries.TokenKind::Named(NamedEntityKind)— Person / Place / Org. - POS Tagging —
PosTagger: lookup-based,pos_th.tsvwith 338 entries, 13 ORCHID-derived categories (NOUN VERB ADJ ADV PART PROPN PRON NUM CLAS CONJ AUX DET PREP). - RTGS Romanization —
RomanizationMap: 415-entry table-driven Thai → Roman mapping. Opt-in via FTS pipeline builder. - Number normalization —
thai_digits_to_ascii,parse_thai_word,u64_to_thai_word,parse_thai_baht. - SQLite FTS5 extension —
kham-sqliteloadable extension: full NLP pipeline, byte-accurate offsets forhighlight()andsnippet(). - FTS pipeline —
FtsTokenizerbuilder wiring POS, NE, romanization, stopwords, synonym expansion in a single pass.
v0.1.0 – v0.1.3 2026-04-18 – 2026-04-19
v0.1.3 — PostgreSQL pg_regress test suite: 67 tests across 4 suites (kham_fts, kham_thai, kham_operators, kham_ranking).
v0.1.2 — PostgreSQL FTS extension (
kham-pg): Thai text search parser for PG 17, 6 token types, make install + Docker regress. FTS modules: stopwords, synonyms, ngrams, FtsTokenizer. C FFI kham_fts_lexemes().v0.1.1 — Dual MIT OR Apache-2.0 licensing.
scripts/deploy.sh. pyproject.toml for maturin builds.v0.1.0 — Initial release: DAG-based newmm segmentation, DARTS dictionary, TCC boundary detection, Unicode pre-tokenizer, zero-copy Token, Python (PyO3) and WASM (wasm-bindgen) bindings, C FFI (cbindgen), CLI, criterion benchmarks.