🐍
Python
kham's Python bindings are built with PyO3 and expose the full segmentation API — including rich Token objects with kind, POS, NE, and span information.
1
Install
pip install kham 2
Basic segmentation
import kham
# Returns a list of strings
words = kham.segment("กินข้าวกับปลา")
print(words)
# ['กิน', 'ข้าว', 'กับ', 'ปลา']
# Mixed-script input is handled correctly
mixed = kham.segment("Hello กรุงเทพ 2024!")
print(mixed)
# ['Hello', ' ', 'กรุงเทพ', ' ', '2024', '!'] 3
Rich Token objects
segment_tokens() returns Token objects with text, kind, and Unicode char spans — ideal for NLP pipelines.
import kham
tokens = kham.segment_tokens("กินข้าวกับปลา")
for tok in tokens:
print(f"{tok.text!r:10} kind={tok.kind:12} chars={tok.char_start}..{tok.char_end}")
# 'กิน' kind=Thai chars=0..3
# 'ข้าว' kind=Thai chars=3..7
# 'กับ' kind=Thai chars=7..10
# 'ปลา' kind=Thai chars=10..13 4
Token fields
Every Token object exposes the following attributes:
tok.text # str — the token text
tok.kind # str — "Thai" | "Latin" | "Number" | "Punctuation" | "Emoji" | "Whitespace" | "Unknown"
tok.char_start # int — Unicode scalar-value start offset (use with str.slice())
tok.char_end # int — Unicode scalar-value end offset
tok.byte_start # int — UTF-8 byte start offset
tok.byte_end # int — UTF-8 byte end offset
tok.confidence # float — 0.0 (Unknown) … 1.0 (high-confidence dict match) 5
Spell checking
spell_suggestions() finds dictionary words within edit distance 2, ranked by phonetic similarity (lk82) then TNC corpus frequency. Pass a single word — segment first for multi-word text.
import kham
# Single misspelled word → ranked suggestions
suggs = kham.spell_suggestions("กีนข้าว", 5)
for s in suggs:
print(f"{s.word:12} edit={s.edit_distance} soundex={s.soundex_match} freq={s.freq_score}")
# กินข้าว edit=1 soundex=True freq=…
# Correct word → edit_distance 0
top = kham.spell_suggestions("กิน", 1)
print(top[0].word, top[0].edit_distance) # กิน 0
# SpellSuggestion: .word .edit_distance .soundex_match .freq_score 6
Keyword extraction
extract_keywords() returns the most distinctive words in a document, scored by TF × inverse-corpus-frequency. Stopwords and single-character tokens are excluded.
import kham
text = ("นักวิทยาศาสตร์ค้นพบดาวเคราะห์ใหม่ในระบบสุริยะ "
"ดาวดวงนี้โคจรอยู่ใกล้ดาวเคราะห์น้อย")
keywords = kham.extract_keywords(text, 5)
for kw in keywords:
print(f"{kw.word:12} score={kw.score:.4f} count={kw.count}")
# Keyword: .word .score .count 7
Build from source (optional)
# Requires Rust toolchain + maturin
pip install maturin
git clone https://github.com/preedep/kham
cd kham
maturin develop -m kham-python/Cargo.toml