API Reference
v0.5.0
Public API of kham-core — the pure Rust, no_std core library.
Full rustdoc is available on docs.rs/kham-core.
| Module / Type | Description |
|---|---|
| Tokenizer | Core Thai word segmenter — maximal matching over a built-in DAWG dictionary. |
| Token / TokenKind | Zero-copy token slice with byte spans, char spans, and script category. |
| FtsTokenizer | Full-text search pipeline: stopwords, synonyms, POS, NE, soundex in one call. |
| PosTagger | 13-category POS tagger derived from the ORCHID tagset. |
| NeTagger | Named entity recognition: Person, Place, Org via built-in gazetteer. |
| RomanizationMap | RTGS romanization lookup — Thai word → Latin transliteration. |
| number | Thai numeral utilities: parse Thai digit strings, word-to-number, baht text. |
| sentence | Sentence boundary detection — split a paragraph into sentence spans. |
| soundex | Phonetic codes: lk82, udom83, MetaSound, and cross-language Thai–English. |
Tokenizer
docs.rs ↗High-level tokenizer backed by a compressed DAWG dictionary and TNC frequency table. Uses the newmm (maximal matching) algorithm with TCC boundaries as candidate split points. All token text fields are zero-copy slices of the input string.
use kham_core::Tokenizer;
let tok = Tokenizer::new();
let tokens = tok.segment("กินข้าวกับปลา");
for t in &tokens {
println!("{} ({:?})", t.text, t.kind);
}
// กิน (Thai)
// ข้าว (Thai)
// กับ (Thai)
// ปลา (Thai) Tokenizer::new() Tokenizer::builder() .segment(&str) → Vec<Token> .normalize(&str) → String Token / TokenKind
docs.rs ↗
Every call to segment() returns
a Vec<Token>.
Tokens carry the original text slice plus two span types: byte offsets for Rust string slicing,
and Unicode scalar-value (char) offsets for Python/JS string indexing.
use kham_core::{Token, TokenKind, Tokenizer};
let tok = Tokenizer::new();
let input = "ธนาคาร100แห่ง";
let tokens = tok.segment(input);
for t in &tokens {
// t.text — &str, zero-copy slice of input
// t.span — Range<usize>, byte offsets (use with &input[t.span.clone()])
// t.char_span — Range<usize>, Unicode scalar-value offsets
// t.kind — TokenKind
assert_eq!(&input[t.span.clone()], t.text); // byte span is exact
println!("{:8} kind={:12?} chars={}..{}", t.text, t.kind,
t.char_span.start, t.char_span.end);
} FtsTokenizer
docs.rs ↗Wraps the segmenter with a configurable NLP pipeline in a single pass: stopword tagging, synonym expansion, POS tagging, NE recognition, RTGS romanization, and phonetic codes. Used internally by the PostgreSQL and SQLite extensions.
use kham_core::fts::FtsTokenizer;
let fts = FtsTokenizer::new();
// All tokens with full metadata
let tokens = fts.segment_for_fts("กินข้าวกับปลา");
for t in &tokens {
println!("{:8} stop={} pos={:?} ne={:?}", t.text, t.is_stop, t.pos, t.ne);
}
// Only indexable tokens (stopwords removed, positions preserved)
let indexed = fts.index_tokens("กินข้าวกับปลา");
// Flat list of lexeme strings ready for tsvector
let lexemes = fts.lexemes("กินข้าวกับปลา"); text position kind is_stop synonyms trigrams pos ne PosTagger
docs.rs ↗Dictionary-lookup POS tagger using a 13-category tagset derived from the ORCHID corpus. The built-in table covers the full kham dictionary. Custom tables can be loaded from TSV.
| Tag | Category | Examples |
|---|---|---|
| NOUN | Noun | คน บ้าน ปลา |
| VERB | Verb | กิน ทำ ไป |
| ADJ | Adjective | ดี ใหญ่ สวย |
| ADV | Adverb | มาก เร็ว เสมอ |
| PART | Particle | ครับ ค่ะ นะ |
| PROPN | Proper noun | กรุงเทพ ไทย |
| PRON | Pronoun | ฉัน เขา เรา |
| NUM | Numeral | หนึ่ง สิบ ร้อย |
| CLAS | Classifier | ตัว ใบ อัน |
| CONJ | Conjunction | และ หรือ แต่ |
| AUX | Auxiliary | ได้ ต้อง กำลัง |
| DET | Determiner | นี้ นั้น ทุก |
| PREP | Preposition | ใน บน ตาม |
use kham_core::pos::PosTagger;
let tagger = PosTagger::builtin();
// Tag a single word
if let Some(pos) = tagger.tag("กิน") {
println!("{:?}", pos); // Verb
}
// Custom tagger from TSV: word TAB POS_TAG
let custom = PosTagger::from_tsv("GPT\tNOUN\nแชทบอท\tNOUN\n");
assert_eq!(custom.tag("แชทบอท"), Some(kham_core::pos::PosTag::Noun)); NeTagger
docs.rs ↗
Gazetteer-based named entity recogniser. Relabels tokens whose text appears in the NE table
from TokenKind::Thai to
TokenKind::Named(kind).
Three categories: Person, Place, Org.
use kham_core::ne::NeTagger;
use kham_core::TokenKind;
use kham_core::Tokenizer;
// Tag a word directly
let ne = NeTagger::builtin();
println!("{:?}", ne.tag("กรุงเทพ")); // Some(Place)
// Post-process a token Vec from Tokenizer::segment
let tok = Tokenizer::new();
let tokens = tok.segment("บริษัทไทยออยล์ก่อตั้งในกรุงเทพ");
let tagged = ne.tag_tokens(tokens, "บริษัทไทยออยล์ก่อตั้งในกรุงเทพ");
for t in &tagged {
if matches!(t.kind, TokenKind::Named(_)) {
println!("{} → {:?}", t.text, t.kind);
}
}
// Custom gazetteer from TSV: word TAB NE_TAG (PERSON | PLACE | ORG)
let custom = NeTagger::from_tsv("แอนโทรปิก\tORG\n"); RomanizationMap
docs.rs ↗
RTGS (Royal Thai General System of Transcription) romanization lookup.
The built-in table covers common words and proper nouns.
Use romanize_or_raw() when you
always want a string back — it falls through to the original Thai on a cache miss.
use kham_core::romanizer::RomanizationMap;
let rom = RomanizationMap::builtin();
// Single word lookup
println!("{:?}", rom.romanize("กรุงเทพ")); // Some("Krung Thep")
println!("{}", rom.romanize_or_raw("ปลา")); // "pla" (or "ปลา" if not in table)
// Batch: takes &[&str], returns Vec<String>
let words = ["กรุงเทพ", "ประเทศ", "ไทย"];
let roman = rom.romanize_tokens(&words);
println!("{:?}", roman); // ["Krung Thep", "prathet", "Thai"] number
docs.rs ↗Thai numeral utilities: digit conversion, Thai number word parsing, Thai baht text generation.
use kham_core::number::{thai_digits_to_ascii, is_thai_digit_str};
// ๑๒๓ → "123"
let ascii = thai_digits_to_ascii("ราคา ๑๒๓ บาท");
println!("{ascii}"); // "ราคา 123 บาท"
println!("{}", is_thai_digit_str("๙๙๙")); // true sentence
docs.rs ↗
Sentence boundary detection. Thai text has no punctuation-based sentence markers,
so the segmenter uses whitespace, newlines, and Thai punctuation patterns.
Each Sentence is a zero-copy span of the input.
use kham_core::sentence::split_sentences;
let text = "กินข้าวกับปลา วันนี้อากาศดี\nพรุ่งนี้จะไปเที่ยว";
let sentences = split_sentences(text);
for (i, s) in sentences.iter().enumerate() {
println!("S{}: {:?}", i, s.text);
}
// S0: "กินข้าวกับปลา วันนี้อากาศดี"
// S1: "พรุ่งนี้จะไปเที่ยว"
// Or use SentenceSegmenter directly for reuse
use kham_core::sentence::SentenceSegmenter;
let seg = SentenceSegmenter::new();
let sentences = seg.split(text); soundex
docs.rs ↗Thai phonetic encoding with three algorithms: lk82 (Royal Institute, 1982), udom83 (Udom, 1983), and MetaSound (multi-level encoding). Cross-language matching for Thai–English name search is also supported.
use kham_core::soundex::{lk82, udom83, metasound, sounds_like, SoundexAlgorithm};
println!("{}", lk82("กรุงเทพ")); // "ก640"
println!("{}", udom83("กรุงเทพ")); // "ก5050"
println!("{}", metasound("กรุงเทพ")); // "ก300"
// Check phonetic similarity
println!("{}", sounds_like("กรุงเทพ", "กรุงแทบ", SoundexAlgorithm::Lk82)); // true