Skip to main content

Changelog

All notable changes to kham. Follows Keep a Changelog and Semantic Versioning.

v0.5.0 2026-04-26 Release ↗

Added — kham-pg

  • kham_fts_dict — custom dictionary template expanding each Thai/Named token to up to 6 lexemes: normalised word, lk82 Thai Soundex code, and RTGS romanization. Enables phonetic-fuzzy and Latin-script romanization search without schema changes.
  • ts_headline supportkham_headline callback registered as HEADLINE function. Marks matching tokens with configurable StartSel/StopSel. 5 regress tests added.
  • Named entity token typeTokenKind::Named(_) → 7 (named) registered in kham_lextypes.

Fixed — kham-pg

  • PG 16+ lexize calling convention — arg3 in kham_dict_lexize is a List* on PG 16+, not a bool. Previously caused every token to be silently discarded as a stopword, producing empty tsvectors for all Thai text.
v0.4.0 2026-04-25 Release ↗

Added

  • Thai phonetic encodingsoundex module: lk82, udom83, metasound, thai_english_soundex. Unified soundex(word, SoundexAlgorithm) dispatch.
  • CLI --soundex flag — phonetic code appears in syn= FTS output field.
  • Accuracy benchmarkkham-bench-accuracy binary: word-boundary P/R/F1 against testdata, --threshold CI gate.
  • PyThaiNLP comparison scriptscripts/compare_pythainlp.py with --export-testdata and --agreed modes.
  • Data expansions — NE gazetteer +17,240 Wikipedia entries + +8,980 Thai family names → 36,600 total. POS table +8,691 ORCHID entries → ~9,000 total. TNC frequency +2,410 entries.
  • Build sizetnc_freq.txt, ne_th.tsv, pos_th.tsv zlib-compressed at compile time via build.rs.

Changed — breaking

  • Compound-first DP scoringDpScore field order changed: minimising token count is now priority 2 (above dict-word maximisation). Fixes systematic over-segmentation. Micro F1 improved from 0.418 → 0.975, sentence agreement vs PyThaiNLP newmm: 1/39 → 37/39 (94.9%).
v0.3.0 2026-04-25 Release ↗

Added

  • abbrev moduleAbbrevMap with 118-entry built-in TSV (months, era markers, ranks, agencies, Bangkok districts). Greedy longest-first pre-tokenisation expansion.
  • date module — Thai date normalization: 7 input formats, Buddhist Era + Gregorian, ISO 8601 output.
  • sentence module — Thai sentence segmentation: Thai terminators, Paiyannoi, universal punctuation, decimal/abbreviation-aware dot rules.
  • FTS pipelineFtsTokenizerBuilder::abbrevs() opt-in abbreviation expansion.
v0.2.0 2026-04-24 Release ↗

Added

  • Named Entity RecognitionNeTagger: gazetteer-based, greedy longest-match, up to 5 consecutive tokens. Built-in NE gazetteer: 10,488 entries. TokenKind::Named(NamedEntityKind) — Person / Place / Org.
  • POS TaggingPosTagger: lookup-based, pos_th.tsv with 338 entries, 13 ORCHID-derived categories (NOUN VERB ADJ ADV PART PROPN PRON NUM CLAS CONJ AUX DET PREP).
  • RTGS RomanizationRomanizationMap: 415-entry table-driven Thai → Roman mapping. Opt-in via FTS pipeline builder.
  • Number normalizationthai_digits_to_ascii, parse_thai_word, u64_to_thai_word, parse_thai_baht.
  • SQLite FTS5 extensionkham-sqlite loadable extension: full NLP pipeline, byte-accurate offsets for highlight() and snippet().
  • FTS pipelineFtsTokenizer builder wiring POS, NE, romanization, stopwords, synonym expansion in a single pass.
v0.1.0 – v0.1.3 2026-04-18 – 2026-04-19
v0.1.3 — PostgreSQL pg_regress test suite: 67 tests across 4 suites (kham_fts, kham_thai, kham_operators, kham_ranking).
v0.1.2 — PostgreSQL FTS extension (kham-pg): Thai text search parser for PG 17, 6 token types, make install + Docker regress. FTS modules: stopwords, synonyms, ngrams, FtsTokenizer. C FFI kham_fts_lexemes().
v0.1.1 — Dual MIT OR Apache-2.0 licensing. scripts/deploy.sh. pyproject.toml for maturin builds.
v0.1.0 — Initial release: DAG-based newmm segmentation, DARTS dictionary, TCC boundary detection, Unicode pre-tokenizer, zero-copy Token, Python (PyO3) and WASM (wasm-bindgen) bindings, C FFI (cbindgen), CLI, criterion benchmarks.