Skip to main content

API Reference

v0.8.2

Public API of kham-core — available for Rust, Python (PyO3), WASM (JavaScript / TypeScript), and C FFI.

Module / Type Description
Tokenizer Core Thai word segmenter — maximal matching over a built-in DAWG dictionary.
Token / TokenKind Zero-copy token slice with byte spans, char spans, and script category.
normalizer Thai text normalization: deduplicate tone marks, compose sara am (อำ).
FtsTokenizer Full-text search pipeline: stopwords, synonyms, POS, NE, soundex in one call.
PosTagger 13-category POS tagger derived from the ORCHID tagset.
NeTagger Named entity recognition: Person, Place, Org via built-in gazetteer.
RomanizationMap RTGS romanization lookup — Thai word → Latin transliteration.
number Thai numeral utilities: parse Thai digit strings, word-to-number, baht text.
sentence Sentence boundary detection — split a paragraph into sentence spans.
soundex Phonetic codes: lk82, udom83, MetaSound, and cross-language Thai–English.
SpellChecker Spell correction: Levenshtein ≤ 2 candidates ranked by lk82 soundex + TNC frequency.
KeyExtractor Keyword extraction: TF × inverse-corpus-frequency, stopwords excluded.

Tokenizer

docs.rs ↗

High-level tokenizer backed by a compressed DAWG dictionary and TNC frequency table. Uses the newmm (maximal matching) algorithm with TCC boundaries. Rust tokens are zero-copy slices of the input string.

use kham_core::Tokenizer;

let tok = Tokenizer::new();

// Simple — list of token strings
let words: Vec<&str> = tok.segment("กินข้าวกับปลา")
    .into_iter().map(|t| t.text).collect();
// ["กิน", "ข้าว", "กับ", "ปลา"]

// Rich — Token structs with span info
let tokens = tok.segment("ธนาคาร100แห่ง");
for t in &tokens {
    println!("{:8} chars={}..{} kind={:?}", t.text,
        t.char_span.start, t.char_span.end, t.kind);
}

// Custom dictionary — merge extra words with built-in
let tok2 = Tokenizer::builder()
    .dict_words("ปัญญาประดิษฐ์\nแมชชีนเลิร์นนิง\n")
    .build();
Rust key methods: Tokenizer::new() Tokenizer::builder() .segment(&str) → Vec<Token>

Token / TokenKind

docs.rs ↗

Every segment() call returns tokens carrying the original text plus two span types: byte offsets (Rust slicing) and Unicode scalar-value offsets (Python / JS indexing).

use kham_core::{TokenKind, NamedEntityKind, Tokenizer};

let tok = Tokenizer::new();
let input = "ธนาคาร100แห่ง";
let tokens = tok.segment(input);

for t in &tokens {
    // t.text       — &str (zero-copy slice of input)
    // t.span       — Range<usize> byte offsets
    // t.char_span  — Range<usize> Unicode scalar-value offsets
    // t.kind       — TokenKind
    // t.confidence — f32: 0.0 (Unknown) … 1.0 (high-confidence dict match)

    assert_eq!(&input[t.span.clone()], t.text);
}

// TokenKind variants:
// Thai | Latin | Number | Punctuation | Emoji | Whitespace | Unknown
// Named(NamedEntityKind::Person | Place | Org)  ← set by NeTagger

normalizer

docs.rs ↗

Two-rule normalization pass: (1) deduplicate consecutive tone marks — keep the last one; (2) compose nikhahit (อํ U+0E4D) + sara aa (อา U+0E32) into sara am (อำ U+0E33). Call before segmenting when input may come from user keyboards or OCR.

use kham_core::normalizer::normalize;

// Rule 1 — deduplicate tone marks (keep last)
assert_eq!(normalize("ข้้าว"), "ข้าว");   // doubled mai tho → single
assert_eq!(normalize("ก่้"),   "ก้");      // mai ek + mai tho → mai tho

// Rule 2 — sara am composition
// nikhahit (U+0E4D) + sara aa (U+0E32) → sara am (U+0E33)
let decomposed = "\u{0E01}\u{0E4D}\u{0E32}"; // กํา (two codepoints)
assert_eq!(normalize(decomposed), "กำ");          // กำ  (one codepoint)

// Already canonical — returned unchanged (no allocation)
assert_eq!(normalize("กินข้าว"), "กินข้าว");

FtsTokenizer

docs.rs ↗

Full NLP pipeline in a single pass: normalize → segment → NE → stopwords → POS → synonyms → romanization. In Python and WASM this is the primary way to access POS and NE metadata.

use kham_core::fts::FtsTokenizer;
use kham_core::soundex::SoundexAlgorithm;
use kham_core::synonym::SynonymMap;

// Default pipeline
let fts = FtsTokenizer::new();
let tokens = fts.segment_for_fts("นายกรัฐมนตรีกินข้าว");
for t in &tokens {
    println!("{:8} pos={:?} ne={:?} stop={}", t.text, t.pos, t.ne, t.is_stop);
}

// index_tokens: preserve positions, filter stopwords for phrase search
let indexed = fts.index_tokens("กินข้าวกับปลา");

// lexemes: flat Vec<String> of text + synonyms + trigrams (for tsvector)
let lexemes = fts.lexemes("กินข้าวกับปลา");

// Custom pipeline
let fts2 = FtsTokenizer::builder()
    .synonyms(SynonymMap::from_tsv("รถ\tรถยนต์\tยานพาหนะ\n"))
    .soundex(SoundexAlgorithm::Lk82)
    .build();
FtsToken fields: textpositionkindis_stopromanposnesynonymstrigrams

PosTagger

docs.rs ↗

Dictionary-lookup POS tagger using a 13-category tagset derived from the ORCHID corpus. In Python, WASM, and C, POS tags are accessed via segment_fts() / kham_fts_segment() — direct tagger construction is Rust-only.

Tag Category Examples
NOUN Noun คน บ้าน ปลา
VERB Verb กิน ทำ ไป
ADJ Adjective ดี ใหญ่ สวย
ADV Adverb มาก เร็ว เสมอ
PART Particle ครับ ค่ะ นะ
PROPN Proper noun กรุงเทพ ไทย
PRON Pronoun ฉัน เขา เรา
NUM Numeral หนึ่ง สิบ ร้อย
CLAS Classifier ตัว ใบ อัน
CONJ Conjunction และ หรือ แต่
AUX Auxiliary ได้ ต้อง กำลัง
DET Determiner นี้ นั้น ทุก
PREP Preposition ใน บน ตาม
use kham_core::pos::PosTagger;

let tagger = PosTagger::builtin();

// Tag a single word
if let Some(pos) = tagger.tag("กิน") {
    println!("{:?}", pos); // Verb
}

// Custom tagger from TSV: word<TAB>POS_TAG
let custom = PosTagger::from_tsv("GPT\tNOUN\nแชทบอท\tNOUN\n");
assert_eq!(custom.tag("แชทบอท"), Some(kham_core::pos::PosTag::Noun));

NeTagger

docs.rs ↗

Gazetteer-based NER with three categories: Person, Place, Org. In Python, WASM, and C, NE tags are accessed via segment_fts() / kham_fts_segment().

use kham_core::ne::NeTagger;
use kham_core::{TokenKind, Tokenizer};

let ne = NeTagger::builtin();
println!("{:?}", ne.tag("กรุงเทพ")); // Some(Place)

// Post-process tokens from Tokenizer::segment
let tok = Tokenizer::new();
let src = "บริษัทไทยออยล์ก่อตั้งในกรุงเทพ";
let tokens = ne.tag_tokens(tok.segment(src), src);

for t in &tokens {
    if matches!(t.kind, TokenKind::Named(_)) {
        println!("{} → {:?}", t.text, t.kind);
    }
}

// Custom gazetteer from TSV: word<TAB>NE_TAG  (PERSON | PLACE | ORG)
let custom = NeTagger::from_tsv("แอนโทรปิก\tORG\n");

RomanizationMap

docs.rs ↗

RTGS (Royal Thai General System) table-lookup romanization. Falls back to the original Thai text for out-of-vocabulary words.

use kham_core::romanizer::RomanizationMap;

let rom = RomanizationMap::builtin();

// Single word lookup
println!("{:?}", rom.romanize("กรุงเทพ"));   // Some("Krung Thep")
println!("{}", rom.romanize_or_raw("ปลา"));   // "pla"
println!("{}", rom.romanize_or_raw("zzz"));   // "zzz"  (OOV → passthrough)

// Batch lookup
let roman = rom.romanize_tokens(&["กรุงเทพ", "ประเทศ", "ไทย"]);
println!("{:?}", roman); // ["Krung Thep", "prathet", "Thai"]

// Whole sentence romanization
let sentence = rom.romanize_sentence("กินข้าวกับปลา 100 บาท");
println!("{sentence}"); // kinkhaokapla 100 bat

number

docs.rs ↗

Thai numeral utilities: convert Thai digit characters (๐–๙) to ASCII, parse/generate Thai cardinal number words, and render Thai Baht currency text.

use kham_core::number::{
    thai_digits_to_ascii, parse_thai_word, u64_to_thai_word,
    parse_thai_baht, to_thai_baht_text,
};

// Thai digits → ASCII
assert_eq!(thai_digits_to_ascii("ราคา ๑๒๓ บาท"), "ราคา 123 บาท");

// Number word parsing
assert_eq!(parse_thai_word("หนึ่งร้อยยี่สิบสาม"), Some(123));
assert_eq!(parse_thai_word("สองล้าน"),             Some(2_000_000));
assert_eq!(parse_thai_word("กินข้าว"),             None); // not a number

// Number → Thai word
println!("{}", u64_to_thai_word(42));        // "สี่สิบสอง"
println!("{}", u64_to_thai_word(1_000_000)); // "หนึ่งล้าน"

// Baht text
println!("{}", to_thai_baht_text(1234, 50));
// "หนึ่งพันสองร้อยสามสิบสี่บาทห้าสิบสตางค์"
if let Some(amt) = parse_thai_baht("หนึ่งร้อยบาทถ้วน") {
    println!("{} baht {} satang", amt.baht, amt.satang); // 100 0
}

sentence

docs.rs ↗

Sentence boundary detection. Splits on newlines, Thai markers (ฯ ๚ ๛), and Western punctuation (! ? . followed by space). Each sentence span carries char offsets for Python/JS string slicing.

use kham_core::sentence::split_sentences;

let text = "คุณชอบอาหารไทยไหม? ผมชอบต้มยำกุ้ง!\nอาหารไทยรสเผ็ด";
let sents = split_sentences(text);

for (i, s) in sents.iter().enumerate() {
    println!("S{i}: {:?}  chars={}..{}", s.text, s.char_span.start, s.char_span.end);
}
// S0: "คุณชอบอาหารไทยไหม?"     chars=0..19
// S1: " ผมชอบต้มยำกุ้ง!"       chars=19..36
// S2: "\nอาหารไทยรสเผ็ด"       chars=36..50

soundex

docs.rs ↗

Thai phonetic encoding: lk82 (12 groups, 4-char), udom83 (14 groups, 4-char), MetaSound (3 chars/syllable). Plus a Thai–English cross-language algorithm (Suwanvisat & Prasitjutrakul 1998) for transliterated name search.

use kham_core::soundex::{
    soundex, sounds_like, SoundexAlgorithm,
    thai_english_soundex, sounds_like_cross_lang,
};

// Thai soundex
println!("{}", soundex("กาน", SoundexAlgorithm::Lk82));      // "1600"
println!("{}", soundex("กาน", SoundexAlgorithm::Udom83));    // "1900"
println!("{}", soundex("กาน", SoundexAlgorithm::MetaSound)); // "112"

// Similarity check
assert!(sounds_like("กาน", "ขาน", SoundexAlgorithm::Lk82));  // same group
assert!(!sounds_like("ลาน", "ราน", SoundexAlgorithm::Udom83)); // ล/ร split

// Thai–English cross-language
println!("{}", thai_english_soundex("Somchai")); // same as thai_english_soundex("สมชาย")
assert!(sounds_like_cross_lang("สมชาย", "Somchai")); // true

SpellChecker

docs.rs ↗

Spelling correction over the built-in 62k-word dictionary. Candidates within Levenshtein edit distance ≤ 2 are returned, ranked by lk82 phonetic similarity, then edit distance, then TNC corpus frequency. Accepts single Thai words — segment first for multi-word input.

use kham_core::spell::SpellChecker;

// Reuse the checker — builtin() loads the TNC frequency map once
let checker = SpellChecker::builtin();

let suggs = checker.suggestions("กีนข้าว", 5);
for s in &suggs {
    println!("{:12} edit={} soundex={} freq={}",
        s.word, s.edit_distance, s.soundex_match, s.freq_score);
}
// กินข้าว  edit=1  soundex=true  freq=…

// Correctly spelled words appear with edit_distance = 0
let exact = checker.suggestions("กิน", 1);
assert_eq!(exact[0].word, "กิน");
assert_eq!(exact[0].edit_distance, 0);

// Suggestion fields:
// s.word          — String   candidate word from the dictionary
// s.edit_distance — u8       Levenshtein distance (0–2)
// s.soundex_match — bool     lk82 codes match
// s.freq_score    — u32      TNC corpus frequency (0 if not in table)

// Single best correction
let checker = SpellChecker::builtin();
if let Some(corrected) = checker.did_you_mean("กีนข้าว") {
    println!("Did you mean: {corrected}");  // กินข้าว
}
// Correct whole text
let text = "กีนข้าวกับปลา";
let out = checker.correct_text(text);
println!("{out}");
Note: SpellChecker expects a single word. For text with multiple words, segment first with Tokenizer::segment() and check each Thai token individually.

KeyExtractor

docs.rs ↗

Unsupervised keyword extraction using TF × inverse-corpus-frequency scoring. Words rare in the TNC corpus score higher than common function words. Stopwords and single-character tokens are always excluded. Results are sorted by score descending.

use kham_core::keyword::KeyExtractor;

// Reuse the extractor — builtin() loads TNC freq + stopwords once
let extractor = KeyExtractor::builtin();

let text = "นักวิทยาศาสตร์ค้นพบดาวเคราะห์ใหม่ในระบบสุริยะ              ดาวดวงนี้โคจรอยู่ใกล้ดาวเคราะห์น้อย";

let keywords = extractor.extract(text, 5);
for kw in &keywords {
    println!("{:12} score={:.4} count={}", kw.word, kw.score, kw.count);
}

// Keyword fields:
// kw.word  — String  the keyword text
// kw.score — f32     TF × (max_freq+1) / (corpus_freq+1)
// kw.count — usize   raw occurrence count in the document