Skip to main content

API Reference

v0.5.0

Public API of kham-core — the pure Rust, no_std core library. Full rustdoc is available on docs.rs/kham-core.

Module / Type Description
Tokenizer Core Thai word segmenter — maximal matching over a built-in DAWG dictionary.
Token / TokenKind Zero-copy token slice with byte spans, char spans, and script category.
FtsTokenizer Full-text search pipeline: stopwords, synonyms, POS, NE, soundex in one call.
PosTagger 13-category POS tagger derived from the ORCHID tagset.
NeTagger Named entity recognition: Person, Place, Org via built-in gazetteer.
RomanizationMap RTGS romanization lookup — Thai word → Latin transliteration.
number Thai numeral utilities: parse Thai digit strings, word-to-number, baht text.
sentence Sentence boundary detection — split a paragraph into sentence spans.
soundex Phonetic codes: lk82, udom83, MetaSound, and cross-language Thai–English.

Tokenizer

docs.rs ↗

High-level tokenizer backed by a compressed DAWG dictionary and TNC frequency table. Uses the newmm (maximal matching) algorithm with TCC boundaries as candidate split points. All token text fields are zero-copy slices of the input string.

use kham_core::Tokenizer;

let tok = Tokenizer::new();
let tokens = tok.segment("กินข้าวกับปลา");

for t in &tokens {
    println!("{} ({:?})", t.text, t.kind);
}
// กิน  (Thai)
// ข้าว (Thai)
// กับ  (Thai)
// ปลา  (Thai)
Key methods: Tokenizer::new() Tokenizer::builder() .segment(&str) → Vec<Token> .normalize(&str) → String

Token / TokenKind

docs.rs ↗

Every call to segment() returns a Vec<Token>. Tokens carry the original text slice plus two span types: byte offsets for Rust string slicing, and Unicode scalar-value (char) offsets for Python/JS string indexing.

use kham_core::{Token, TokenKind, Tokenizer};

let tok = Tokenizer::new();
let input = "ธนาคาร100แห่ง";
let tokens = tok.segment(input);

for t in &tokens {
    // t.text      — &str, zero-copy slice of input
    // t.span      — Range<usize>, byte offsets (use with &input[t.span.clone()])
    // t.char_span — Range<usize>, Unicode scalar-value offsets
    // t.kind      — TokenKind

    assert_eq!(&input[t.span.clone()], t.text); // byte span is exact
    println!("{:8} kind={:12?}  chars={}..{}", t.text, t.kind,
        t.char_span.start, t.char_span.end);
}

FtsTokenizer

docs.rs ↗

Wraps the segmenter with a configurable NLP pipeline in a single pass: stopword tagging, synonym expansion, POS tagging, NE recognition, RTGS romanization, and phonetic codes. Used internally by the PostgreSQL and SQLite extensions.

use kham_core::fts::FtsTokenizer;

let fts = FtsTokenizer::new();

// All tokens with full metadata
let tokens = fts.segment_for_fts("กินข้าวกับปลา");
for t in &tokens {
    println!("{:8} stop={} pos={:?} ne={:?}", t.text, t.is_stop, t.pos, t.ne);
}

// Only indexable tokens (stopwords removed, positions preserved)
let indexed = fts.index_tokens("กินข้าวกับปลา");

// Flat list of lexeme strings ready for tsvector
let lexemes = fts.lexemes("กินข้าวกับปลา");
FtsToken fields: text position kind is_stop synonyms trigrams pos ne

PosTagger

docs.rs ↗

Dictionary-lookup POS tagger using a 13-category tagset derived from the ORCHID corpus. The built-in table covers the full kham dictionary. Custom tables can be loaded from TSV.

Tag Category Examples
NOUN Noun คน บ้าน ปลา
VERB Verb กิน ทำ ไป
ADJ Adjective ดี ใหญ่ สวย
ADV Adverb มาก เร็ว เสมอ
PART Particle ครับ ค่ะ นะ
PROPN Proper noun กรุงเทพ ไทย
PRON Pronoun ฉัน เขา เรา
NUM Numeral หนึ่ง สิบ ร้อย
CLAS Classifier ตัว ใบ อัน
CONJ Conjunction และ หรือ แต่
AUX Auxiliary ได้ ต้อง กำลัง
DET Determiner นี้ นั้น ทุก
PREP Preposition ใน บน ตาม
use kham_core::pos::PosTagger;

let tagger = PosTagger::builtin();

// Tag a single word
if let Some(pos) = tagger.tag("กิน") {
    println!("{:?}", pos); // Verb
}

// Custom tagger from TSV: word TAB POS_TAG
let custom = PosTagger::from_tsv("GPT\tNOUN\nแชทบอท\tNOUN\n");
assert_eq!(custom.tag("แชทบอท"), Some(kham_core::pos::PosTag::Noun));

NeTagger

docs.rs ↗

Gazetteer-based named entity recogniser. Relabels tokens whose text appears in the NE table from TokenKind::Thai to TokenKind::Named(kind). Three categories: Person, Place, Org.

use kham_core::ne::NeTagger;
use kham_core::TokenKind;
use kham_core::Tokenizer;

// Tag a word directly
let ne = NeTagger::builtin();
println!("{:?}", ne.tag("กรุงเทพ")); // Some(Place)

// Post-process a token Vec from Tokenizer::segment
let tok = Tokenizer::new();
let tokens = tok.segment("บริษัทไทยออยล์ก่อตั้งในกรุงเทพ");
let tagged  = ne.tag_tokens(tokens, "บริษัทไทยออยล์ก่อตั้งในกรุงเทพ");

for t in &tagged {
    if matches!(t.kind, TokenKind::Named(_)) {
        println!("{} → {:?}", t.text, t.kind);
    }
}

// Custom gazetteer from TSV: word TAB NE_TAG  (PERSON | PLACE | ORG)
let custom = NeTagger::from_tsv("แอนโทรปิก\tORG\n");

RomanizationMap

docs.rs ↗

RTGS (Royal Thai General System of Transcription) romanization lookup. The built-in table covers common words and proper nouns. Use romanize_or_raw() when you always want a string back — it falls through to the original Thai on a cache miss.

use kham_core::romanizer::RomanizationMap;

let rom = RomanizationMap::builtin();

// Single word lookup
println!("{:?}", rom.romanize("กรุงเทพ"));   // Some("Krung Thep")
println!("{}", rom.romanize_or_raw("ปลา"));  // "pla" (or "ปลา" if not in table)

// Batch: takes &[&str], returns Vec<String>
let words  = ["กรุงเทพ", "ประเทศ", "ไทย"];
let roman  = rom.romanize_tokens(&words);
println!("{:?}", roman); // ["Krung Thep", "prathet", "Thai"]

number

docs.rs ↗

Thai numeral utilities: digit conversion, Thai number word parsing, Thai baht text generation.

use kham_core::number::{thai_digits_to_ascii, is_thai_digit_str};

// ๑๒๓ → "123"
let ascii = thai_digits_to_ascii("ราคา ๑๒๓ บาท");
println!("{ascii}"); // "ราคา 123 บาท"

println!("{}", is_thai_digit_str("๙๙๙")); // true

sentence

docs.rs ↗

Sentence boundary detection. Thai text has no punctuation-based sentence markers, so the segmenter uses whitespace, newlines, and Thai punctuation patterns. Each Sentence is a zero-copy span of the input.

use kham_core::sentence::split_sentences;

let text = "กินข้าวกับปลา วันนี้อากาศดี\nพรุ่งนี้จะไปเที่ยว";
let sentences = split_sentences(text);

for (i, s) in sentences.iter().enumerate() {
    println!("S{}: {:?}", i, s.text);
}
// S0: "กินข้าวกับปลา วันนี้อากาศดี"
// S1: "พรุ่งนี้จะไปเที่ยว"

// Or use SentenceSegmenter directly for reuse
use kham_core::sentence::SentenceSegmenter;
let seg = SentenceSegmenter::new();
let sentences = seg.split(text);

soundex

docs.rs ↗

Thai phonetic encoding with three algorithms: lk82 (Royal Institute, 1982), udom83 (Udom, 1983), and MetaSound (multi-level encoding). Cross-language matching for Thai–English name search is also supported.

use kham_core::soundex::{lk82, udom83, metasound, sounds_like, SoundexAlgorithm};

println!("{}", lk82("กรุงเทพ"));     // "ก640"
println!("{}", udom83("กรุงเทพ"));   // "ก5050"
println!("{}", metasound("กรุงเทพ")); // "ก300"

// Check phonetic similarity
println!("{}", sounds_like("กรุงเทพ", "กรุงแทบ", SoundexAlgorithm::Lk82)); // true