API Reference
v0.8.2
Public API of kham-core —
available for Rust, Python (PyO3), WASM (JavaScript / TypeScript), and C FFI.
| Module / Type | Description |
|---|---|
| Tokenizer | Core Thai word segmenter — maximal matching over a built-in DAWG dictionary. |
| Token / TokenKind | Zero-copy token slice with byte spans, char spans, and script category. |
| normalizer | Thai text normalization: deduplicate tone marks, compose sara am (อำ). |
| FtsTokenizer | Full-text search pipeline: stopwords, synonyms, POS, NE, soundex in one call. |
| PosTagger | 13-category POS tagger derived from the ORCHID tagset. |
| NeTagger | Named entity recognition: Person, Place, Org via built-in gazetteer. |
| RomanizationMap | RTGS romanization lookup — Thai word → Latin transliteration. |
| number | Thai numeral utilities: parse Thai digit strings, word-to-number, baht text. |
| sentence | Sentence boundary detection — split a paragraph into sentence spans. |
| soundex | Phonetic codes: lk82, udom83, MetaSound, and cross-language Thai–English. |
| SpellChecker | Spell correction: Levenshtein ≤ 2 candidates ranked by lk82 soundex + TNC frequency. |
| KeyExtractor | Keyword extraction: TF × inverse-corpus-frequency, stopwords excluded. |
Tokenizer
docs.rs ↗High-level tokenizer backed by a compressed DAWG dictionary and TNC frequency table. Uses the newmm (maximal matching) algorithm with TCC boundaries. Rust tokens are zero-copy slices of the input string.
use kham_core::Tokenizer;
let tok = Tokenizer::new();
// Simple — list of token strings
let words: Vec<&str> = tok.segment("กินข้าวกับปลา")
.into_iter().map(|t| t.text).collect();
// ["กิน", "ข้าว", "กับ", "ปลา"]
// Rich — Token structs with span info
let tokens = tok.segment("ธนาคาร100แห่ง");
for t in &tokens {
println!("{:8} chars={}..{} kind={:?}", t.text,
t.char_span.start, t.char_span.end, t.kind);
}
// Custom dictionary — merge extra words with built-in
let tok2 = Tokenizer::builder()
.dict_words("ปัญญาประดิษฐ์\nแมชชีนเลิร์นนิง\n")
.build(); Tokenizer::new() Tokenizer::builder() .segment(&str) → Vec<Token> Token / TokenKind
docs.rs ↗
Every segment() call
returns tokens carrying the original text plus two span types: byte offsets (Rust slicing)
and Unicode scalar-value offsets (Python / JS indexing).
use kham_core::{TokenKind, NamedEntityKind, Tokenizer};
let tok = Tokenizer::new();
let input = "ธนาคาร100แห่ง";
let tokens = tok.segment(input);
for t in &tokens {
// t.text — &str (zero-copy slice of input)
// t.span — Range<usize> byte offsets
// t.char_span — Range<usize> Unicode scalar-value offsets
// t.kind — TokenKind
// t.confidence — f32: 0.0 (Unknown) … 1.0 (high-confidence dict match)
assert_eq!(&input[t.span.clone()], t.text);
}
// TokenKind variants:
// Thai | Latin | Number | Punctuation | Emoji | Whitespace | Unknown
// Named(NamedEntityKind::Person | Place | Org) ← set by NeTagger normalizer
docs.rs ↗Two-rule normalization pass: (1) deduplicate consecutive tone marks — keep the last one; (2) compose nikhahit (อํ U+0E4D) + sara aa (อา U+0E32) into sara am (อำ U+0E33). Call before segmenting when input may come from user keyboards or OCR.
use kham_core::normalizer::normalize;
// Rule 1 — deduplicate tone marks (keep last)
assert_eq!(normalize("ข้้าว"), "ข้าว"); // doubled mai tho → single
assert_eq!(normalize("ก่้"), "ก้"); // mai ek + mai tho → mai tho
// Rule 2 — sara am composition
// nikhahit (U+0E4D) + sara aa (U+0E32) → sara am (U+0E33)
let decomposed = "\u{0E01}\u{0E4D}\u{0E32}"; // กํา (two codepoints)
assert_eq!(normalize(decomposed), "กำ"); // กำ (one codepoint)
// Already canonical — returned unchanged (no allocation)
assert_eq!(normalize("กินข้าว"), "กินข้าว"); FtsTokenizer
docs.rs ↗Full NLP pipeline in a single pass: normalize → segment → NE → stopwords → POS → synonyms → romanization. In Python and WASM this is the primary way to access POS and NE metadata.
use kham_core::fts::FtsTokenizer;
use kham_core::soundex::SoundexAlgorithm;
use kham_core::synonym::SynonymMap;
// Default pipeline
let fts = FtsTokenizer::new();
let tokens = fts.segment_for_fts("นายกรัฐมนตรีกินข้าว");
for t in &tokens {
println!("{:8} pos={:?} ne={:?} stop={}", t.text, t.pos, t.ne, t.is_stop);
}
// index_tokens: preserve positions, filter stopwords for phrase search
let indexed = fts.index_tokens("กินข้าวกับปลา");
// lexemes: flat Vec<String> of text + synonyms + trigrams (for tsvector)
let lexemes = fts.lexemes("กินข้าวกับปลา");
// Custom pipeline
let fts2 = FtsTokenizer::builder()
.synonyms(SynonymMap::from_tsv("รถ\tรถยนต์\tยานพาหนะ\n"))
.soundex(SoundexAlgorithm::Lk82)
.build(); textpositionkindis_stopromanposnesynonymstrigrams PosTagger
docs.rs ↗
Dictionary-lookup POS tagger using a 13-category tagset derived from the ORCHID corpus.
In Python, WASM, and C, POS tags are accessed via segment_fts() / kham_fts_segment() — direct tagger construction is Rust-only.
| Tag | Category | Examples |
|---|---|---|
| NOUN | Noun | คน บ้าน ปลา |
| VERB | Verb | กิน ทำ ไป |
| ADJ | Adjective | ดี ใหญ่ สวย |
| ADV | Adverb | มาก เร็ว เสมอ |
| PART | Particle | ครับ ค่ะ นะ |
| PROPN | Proper noun | กรุงเทพ ไทย |
| PRON | Pronoun | ฉัน เขา เรา |
| NUM | Numeral | หนึ่ง สิบ ร้อย |
| CLAS | Classifier | ตัว ใบ อัน |
| CONJ | Conjunction | และ หรือ แต่ |
| AUX | Auxiliary | ได้ ต้อง กำลัง |
| DET | Determiner | นี้ นั้น ทุก |
| PREP | Preposition | ใน บน ตาม |
use kham_core::pos::PosTagger;
let tagger = PosTagger::builtin();
// Tag a single word
if let Some(pos) = tagger.tag("กิน") {
println!("{:?}", pos); // Verb
}
// Custom tagger from TSV: word<TAB>POS_TAG
let custom = PosTagger::from_tsv("GPT\tNOUN\nแชทบอท\tNOUN\n");
assert_eq!(custom.tag("แชทบอท"), Some(kham_core::pos::PosTag::Noun)); NeTagger
docs.rs ↗
Gazetteer-based NER with three categories: Person, Place, Org.
In Python, WASM, and C, NE tags are accessed via segment_fts() / kham_fts_segment().
use kham_core::ne::NeTagger;
use kham_core::{TokenKind, Tokenizer};
let ne = NeTagger::builtin();
println!("{:?}", ne.tag("กรุงเทพ")); // Some(Place)
// Post-process tokens from Tokenizer::segment
let tok = Tokenizer::new();
let src = "บริษัทไทยออยล์ก่อตั้งในกรุงเทพ";
let tokens = ne.tag_tokens(tok.segment(src), src);
for t in &tokens {
if matches!(t.kind, TokenKind::Named(_)) {
println!("{} → {:?}", t.text, t.kind);
}
}
// Custom gazetteer from TSV: word<TAB>NE_TAG (PERSON | PLACE | ORG)
let custom = NeTagger::from_tsv("แอนโทรปิก\tORG\n"); RomanizationMap
docs.rs ↗RTGS (Royal Thai General System) table-lookup romanization. Falls back to the original Thai text for out-of-vocabulary words.
use kham_core::romanizer::RomanizationMap;
let rom = RomanizationMap::builtin();
// Single word lookup
println!("{:?}", rom.romanize("กรุงเทพ")); // Some("Krung Thep")
println!("{}", rom.romanize_or_raw("ปลา")); // "pla"
println!("{}", rom.romanize_or_raw("zzz")); // "zzz" (OOV → passthrough)
// Batch lookup
let roman = rom.romanize_tokens(&["กรุงเทพ", "ประเทศ", "ไทย"]);
println!("{:?}", roman); // ["Krung Thep", "prathet", "Thai"]
// Whole sentence romanization
let sentence = rom.romanize_sentence("กินข้าวกับปลา 100 บาท");
println!("{sentence}"); // kinkhaokapla 100 bat number
docs.rs ↗Thai numeral utilities: convert Thai digit characters (๐–๙) to ASCII, parse/generate Thai cardinal number words, and render Thai Baht currency text.
use kham_core::number::{
thai_digits_to_ascii, parse_thai_word, u64_to_thai_word,
parse_thai_baht, to_thai_baht_text,
};
// Thai digits → ASCII
assert_eq!(thai_digits_to_ascii("ราคา ๑๒๓ บาท"), "ราคา 123 บาท");
// Number word parsing
assert_eq!(parse_thai_word("หนึ่งร้อยยี่สิบสาม"), Some(123));
assert_eq!(parse_thai_word("สองล้าน"), Some(2_000_000));
assert_eq!(parse_thai_word("กินข้าว"), None); // not a number
// Number → Thai word
println!("{}", u64_to_thai_word(42)); // "สี่สิบสอง"
println!("{}", u64_to_thai_word(1_000_000)); // "หนึ่งล้าน"
// Baht text
println!("{}", to_thai_baht_text(1234, 50));
// "หนึ่งพันสองร้อยสามสิบสี่บาทห้าสิบสตางค์"
if let Some(amt) = parse_thai_baht("หนึ่งร้อยบาทถ้วน") {
println!("{} baht {} satang", amt.baht, amt.satang); // 100 0
} sentence
docs.rs ↗
Sentence boundary detection. Splits on newlines, Thai markers (ฯ ๚ ๛),
and Western punctuation (! ? . followed by space).
Each sentence span carries char offsets for Python/JS string slicing.
use kham_core::sentence::split_sentences;
let text = "คุณชอบอาหารไทยไหม? ผมชอบต้มยำกุ้ง!\nอาหารไทยรสเผ็ด";
let sents = split_sentences(text);
for (i, s) in sents.iter().enumerate() {
println!("S{i}: {:?} chars={}..{}", s.text, s.char_span.start, s.char_span.end);
}
// S0: "คุณชอบอาหารไทยไหม?" chars=0..19
// S1: " ผมชอบต้มยำกุ้ง!" chars=19..36
// S2: "\nอาหารไทยรสเผ็ด" chars=36..50 soundex
docs.rs ↗Thai phonetic encoding: lk82 (12 groups, 4-char), udom83 (14 groups, 4-char), MetaSound (3 chars/syllable). Plus a Thai–English cross-language algorithm (Suwanvisat & Prasitjutrakul 1998) for transliterated name search.
use kham_core::soundex::{
soundex, sounds_like, SoundexAlgorithm,
thai_english_soundex, sounds_like_cross_lang,
};
// Thai soundex
println!("{}", soundex("กาน", SoundexAlgorithm::Lk82)); // "1600"
println!("{}", soundex("กาน", SoundexAlgorithm::Udom83)); // "1900"
println!("{}", soundex("กาน", SoundexAlgorithm::MetaSound)); // "112"
// Similarity check
assert!(sounds_like("กาน", "ขาน", SoundexAlgorithm::Lk82)); // same group
assert!(!sounds_like("ลาน", "ราน", SoundexAlgorithm::Udom83)); // ล/ร split
// Thai–English cross-language
println!("{}", thai_english_soundex("Somchai")); // same as thai_english_soundex("สมชาย")
assert!(sounds_like_cross_lang("สมชาย", "Somchai")); // true SpellChecker
docs.rs ↗Spelling correction over the built-in 62k-word dictionary. Candidates within Levenshtein edit distance ≤ 2 are returned, ranked by lk82 phonetic similarity, then edit distance, then TNC corpus frequency. Accepts single Thai words — segment first for multi-word input.
use kham_core::spell::SpellChecker;
// Reuse the checker — builtin() loads the TNC frequency map once
let checker = SpellChecker::builtin();
let suggs = checker.suggestions("กีนข้าว", 5);
for s in &suggs {
println!("{:12} edit={} soundex={} freq={}",
s.word, s.edit_distance, s.soundex_match, s.freq_score);
}
// กินข้าว edit=1 soundex=true freq=…
// Correctly spelled words appear with edit_distance = 0
let exact = checker.suggestions("กิน", 1);
assert_eq!(exact[0].word, "กิน");
assert_eq!(exact[0].edit_distance, 0);
// Suggestion fields:
// s.word — String candidate word from the dictionary
// s.edit_distance — u8 Levenshtein distance (0–2)
// s.soundex_match — bool lk82 codes match
// s.freq_score — u32 TNC corpus frequency (0 if not in table)
// Single best correction
let checker = SpellChecker::builtin();
if let Some(corrected) = checker.did_you_mean("กีนข้าว") {
println!("Did you mean: {corrected}"); // กินข้าว
}
// Correct whole text
let text = "กีนข้าวกับปลา";
let out = checker.correct_text(text);
println!("{out}"); Tokenizer::segment() and
check each Thai token individually.
KeyExtractor
docs.rs ↗Unsupervised keyword extraction using TF × inverse-corpus-frequency scoring. Words rare in the TNC corpus score higher than common function words. Stopwords and single-character tokens are always excluded. Results are sorted by score descending.
use kham_core::keyword::KeyExtractor;
// Reuse the extractor — builtin() loads TNC freq + stopwords once
let extractor = KeyExtractor::builtin();
let text = "นักวิทยาศาสตร์ค้นพบดาวเคราะห์ใหม่ในระบบสุริยะ ดาวดวงนี้โคจรอยู่ใกล้ดาวเคราะห์น้อย";
let keywords = extractor.extract(text, 5);
for kw in &keywords {
println!("{:12} score={:.4} count={}", kw.word, kw.score, kw.count);
}
// Keyword fields:
// kw.word — String the keyword text
// kw.score — f32 TF × (max_freq+1) / (corpus_freq+1)
// kw.count — usize raw occurrence count in the document