Changelog

All notable changes to kham. Follows Keep a Changelog and Semantic Versioning.

v0.8.2 2026-05-03 Release ↗

Added — kham-core data

NE gazetteer expanded — 36,668 → 38,950 entries (+2,280 from pythainlp/thainer-corpus-v2, CC0): +403 PERSON, +793 PLACE, +1,084 ORG.
POS table expanded — 8,993 → 11,404 entries (+2,407 from UD_Thai-PUD, CC-BY-SA-3.0): +884 PROPN, +665 VERB, +662 NOUN, +134 ADJ, and more. UPOS mapped to kham 13-category scheme.
wiki_freq.tsv — new supplemental word-frequency file (CC-BY-SA-4.0) from 500 Thai Wikipedia articles; 19,890 entries. Kept separate from CC0 tnc_freq.txt; not yet loaded by FreqMap.

v0.8.1 2026-05-03 Release ↗

Fixed

Docker Hub image tags — corrected tags to use v-prefixed format (e.g. v0.8.1); the docs previously showed tags without the v prefix.

v0.8.0 2026-05-02 Release ↗

Added — kham-core

Token::confidence: f32 — segmentation confidence score on every token; 0.0 for Unknown tokens, 1.0 for unambiguous dict matches; intermediate values reflect TNC frequency and boundary ambiguity. Propagated into FtsToken::confidence and all bindings.
TokenStream / Tokenizer::segment_stream(text) — streaming iterator with next_word(), next_known(), and next_above_confidence(f32).
SpellChecker::did_you_mean(word) — returns None if the word is in the dictionary, Some(best_match) otherwise.
SpellChecker::correct_text(text) — segments the input and replaces every Unknown token (≥ 2 chars) with its best spelling correction; known tokens pass through unchanged.
RomanizationMap::romanize_sentence(text) — segments and RTGS-romanizes every Thai/Named token; non-Thai tokens (numbers, Latin, punctuation, whitespace) pass through as-is.
KeyExtractor::extract_phrases(text, max_n) — bigram and trigram keyphrases from adjacent content tokens, scored by TF × average-IDF; complements the existing extract() unigram extraction.

Added — bindings (Python / WASM / C FFI)

spell_did_you_mean / kham_spell_did_you_mean — single-word correction check exposed in all three bindings.
spell_correct_text / kham_spell_correct_text — full-text spell correction exposed in all three bindings.
romanize_sentence / kham_romanize_sentence — sentence-level RTGS romanization exposed in all three bindings.
extract_phrases / kham_extract_phrases — keyphrase extraction exposed in all three bindings.
segment_above_confidence(text, min_confidence) (Python / WASM) — convenience filter returning only tokens at or above the confidence threshold via segment_stream.
confidence field on Token — exposed in Python, WASM, and C FFI (KhamToken) as a float / f32.

Added — kham-cli

--confidence — append conf=X.XX per token in text output mode.
--min-confidence <MIN> — filter output to tokens with confidence ≥ MIN via segment_stream.
--format text|json|csv — structured output for both basic and FTS modes.
--romanize — segment and romanize Thai text to RTGS Latin; non-Thai tokens pass through.
--spell — spell-check mode: ranked suggestions for the input word using Levenshtein + phonetic re-ranking.
--keywords — keyword mode: top keywords and keyphrases from the text with TF × IDF scores.
--top-n <N> — max results in --spell or --keywords mode (default: 10).

v0.7.0 2026-05-02 Release ↗

Added — kham-pg

Stopword suppression — Thai grammatical particles (กับ, ใน, ของ, …) are suppressed by kham_fts_dict and excluded from the tsvector, reducing index noise without any configuration.
Thai number normalization — Thai digit strings (๑๒๓) are now indexed alongside their ASCII equivalent (123) as colocated lexemes; cross-script numeric queries like plainto_tsquery('kham', '123') match documents containing ๑๒๓ automatically.
POS lexeme expansion — tokens with a known part of speech emit a colocated pos_<tag> lexeme (e.g. pos_noun, pos_verb); filter by POS with 'pos_verb'::tsquery.
kham_fts_dict_udom83 and kham_fts_dict_metasound — two new dictionary variants backed by the udom83 and MetaSound soundex algorithms; swap dictionaries in custom FTS configurations for finer phonetic discrimination.
kham_tsvector(text) and kham_tsquery(text) — SQL STABLE convenience helpers; shorthand for to_tsvector('kham', …) and plainto_tsquery('kham', …).
StopwordSet::builtin_with_extra(extra) (kham-core) — combines the built-in 1 029-word stopword list with caller-supplied domain words in one call.
FtsTokenizer::segment_stream(text) (kham-core) — returns an FtsTokenStream iterator with next_index_token() to advance past stopwords automatically.

Changed — kham-pg

number token mapping — the built-in kham configuration now routes number tokens through kham_fts_dict instead of kham_dict, enabling Thai digit normalization. Use ALTER EXTENSION kham_pg UPDATE to apply the change to existing databases.

v0.6.0 2026-05-01 Release ↗

Added — kham-core

SpellChecker — SpellChecker::builtin().suggestions(word, n) returns ranked candidates (Levenshtein ≤ 2, lk82 soundex match, TNC frequency score).
KeyExtractor — KeyExtractor::builtin().extract(text, n) returns top-N keywords by TF × IDF-proxy score with stopword suppression.
FtsTokenizerBuilder::dict_merge() — overlays extra words on the built-in dictionary when constructing an FtsTokenizer (fast overlay path, no trie rebuild).

Added — bindings (WASM / Python / C FFI)

spell_suggestions / kham_spell_suggestions — exposed in kham-wasm, kham-python, and kham-capi with rich result types (SpellSuggestion / KhamSpellList).
extract_keywords / kham_keywords — exposed in all three bindings with Keyword / KhamKeywordList result types.
Live demo — Spell & Keywords tabs — interactive browser demo for spell checking and keyword extraction powered by WASM.

Added — kham-sqlite

Custom synonym map — synonyms '<path>' tokenize argument loads a TSV synonym file at table-creation time. Synonyms are emitted as FTS5_TOKEN_COLOCATED so queries match canonical and synonym forms.
Custom dictionary overlay — dict '<path>' tokenize argument overlays domain-specific words on the built-in dictionary without a full trie rebuild.
Integration test suite — 31 tests covering basic MATCH, RTGS romanization, lk82 soundex, snippet()/highlight(), stopword filtering, mixed script, NE recognition, all config options, custom synonyms and dict.
Windows build support — build.rs now detects Windows and resolves SQLite headers via vcpkg (VCPKG_ROOT) or SQLITE_INCLUDE_DIR override. Pre-built kham_sqlite.dll shipped in release assets.
Android NDK build — CI release workflow cross-compiles libkham_sqlite.so for all 4 Android ABIs (arm64-v8a, armeabi-v7a, x86_64, x86) via Android NDK. Requires sqlite-android or SQLCipher (system SQLite has load_extension disabled).

Fixed — kham-sqlite

Trigrams not emitted — FtsToken::trigrams for Unknown tokens was populated by the FTS pipeline but never forwarded to SQLite as colocated tokens. OOV n-gram search now works correctly.

v0.5.1 2026-04-27 Release ↗

Fixed

WASM number overflow — corrected u32 overflow for large numbers in kham-wasm binding.
NE tag correction — ประเทศไทย re-tagged from PERSON → PLACE.

v0.5.0 2026-04-26 Release ↗

Added — kham-pg

kham_fts_dict — custom dictionary template expanding each Thai/Named token to up to 6 lexemes: normalised word, lk82 Thai Soundex code, and RTGS romanization. Enables phonetic-fuzzy and Latin-script romanization search without schema changes.
ts_headline support — kham_headline callback registered as HEADLINE function. Marks matching tokens with configurable StartSel/StopSel. 5 regress tests added.
Named entity token type — TokenKind::Named(_) → 7 (named) registered in kham_lextypes.

Fixed — kham-pg

PG 16+ lexize calling convention — arg3 in kham_dict_lexize is a List* on PG 16+, not a bool. Previously caused every token to be silently discarded as a stopword, producing empty tsvectors for all Thai text.

v0.4.0 2026-04-25 Release ↗

Added

Thai phonetic encoding — soundex module: lk82, udom83, metasound, thai_english_soundex. Unified soundex(word, SoundexAlgorithm) dispatch.
CLI --soundex flag — phonetic code appears in syn= FTS output field.
Accuracy benchmark — kham-bench-accuracy binary: word-boundary P/R/F1 against testdata, --threshold CI gate.
PyThaiNLP comparison script — scripts/compare_pythainlp.py with --export-testdata and --agreed modes.
Data expansions — NE gazetteer +17,240 Wikipedia entries + +8,980 Thai family names → 36,600 total. POS table +8,691 ORCHID entries → ~9,000 total. TNC frequency +2,410 entries.
Build size — tnc_freq.txt, ne_th.tsv, pos_th.tsv zlib-compressed at compile time via build.rs.

Changed — breaking

Compound-first DP scoring — DpScore field order changed: minimising token count is now priority 2 (above dict-word maximisation). Fixes systematic over-segmentation. Micro F1 improved from 0.418 → 0.975, sentence agreement vs PyThaiNLP newmm: 1/39 → 37/39 (94.9%).

v0.3.0 2026-04-25 Release ↗

Added

abbrev module — AbbrevMap with 118-entry built-in TSV (months, era markers, ranks, agencies, Bangkok districts). Greedy longest-first pre-tokenisation expansion.
date module — Thai date normalization: 7 input formats, Buddhist Era + Gregorian, ISO 8601 output.
sentence module — Thai sentence segmentation: Thai terminators, Paiyannoi, universal punctuation, decimal/abbreviation-aware dot rules.
FTS pipeline — FtsTokenizerBuilder::abbrevs() opt-in abbreviation expansion.

v0.2.0 2026-04-24 Release ↗

Added

Named Entity Recognition — NeTagger: gazetteer-based, greedy longest-match, up to 5 consecutive tokens. Built-in NE gazetteer: 10,488 entries. TokenKind::Named(NamedEntityKind) — Person / Place / Org.
POS Tagging — PosTagger: lookup-based, pos_th.tsv with 338 entries, 13 ORCHID-derived categories (NOUN VERB ADJ ADV PART PROPN PRON NUM CLAS CONJ AUX DET PREP).
RTGS Romanization — RomanizationMap: 415-entry table-driven Thai → Roman mapping. Opt-in via FTS pipeline builder.
Number normalization — thai_digits_to_ascii, parse_thai_word, u64_to_thai_word, parse_thai_baht.
SQLite FTS5 extension — kham-sqlite loadable extension: full NLP pipeline, byte-accurate offsets for highlight() and snippet().
FTS pipeline — FtsTokenizer builder wiring POS, NE, romanization, stopwords, synonym expansion in a single pass.

v0.1.0 – v0.1.3 2026-04-18 – 2026-04-19

v0.1.3 — PostgreSQL pg_regress test suite: 67 tests across 4 suites (kham_fts, kham_thai, kham_operators, kham_ranking).

v0.1.2 — PostgreSQL FTS extension (kham-pg): Thai text search parser for PG 17, 6 token types, make install + Docker regress. FTS modules: stopwords, synonyms, ngrams, FtsTokenizer. C FFI kham_fts_lexemes().

v0.1.1 — Dual MIT OR Apache-2.0 licensing. scripts/deploy.sh. pyproject.toml for maturin builds.

v0.1.0 — Initial release: DAG-based newmm segmentation, DARTS dictionary, TCC boundary detection, Unicode pre-tokenizer, zero-copy Token, Python (PyO3) and WASM (wasm-bindgen) bindings, C FFI (cbindgen), CLI, criterion benchmarks.

← Benchmarks All releases on GitHub ↗