🐍
Python
kham's Python bindings are built with PyO3 and expose the full segmentation API — including rich Token objects with kind, POS, NE, and span information.
1
Install
pip install kham 2
Basic segmentation
import kham
# Returns a list of strings
words = kham.segment("กินข้าวกับปลา")
print(words)
# ['กิน', 'ข้าว', 'กับ', 'ปลา']
# Mixed-script input is handled correctly
mixed = kham.segment("Hello กรุงเทพ 2024!")
print(mixed)
# ['Hello', ' ', 'กรุงเทพ', ' ', '2024', '!'] 3
Rich Token objects
segment_tokens() returns Token objects with text, kind, and Unicode char spans — ideal for NLP pipelines.
import kham
tokens = kham.segment_tokens("กินข้าวกับปลา")
for tok in tokens:
print(f"{tok.text!r:10} kind={tok.kind:12} chars={tok.char_start}..{tok.char_end}")
# 'กิน' kind=Thai chars=0..3
# 'ข้าว' kind=Thai chars=3..7
# 'กับ' kind=Thai chars=7..10
# 'ปลา' kind=Thai chars=10..13 4
Token fields
Every Token object exposes the following attributes:
tok.text # str — the token text
tok.kind # str — "Thai" | "Latin" | "Number" | "Punctuation" | "Emoji" | "Whitespace" | "Unknown"
tok.char_start # int — Unicode scalar-value start offset (use with str.slice())
tok.char_end # int — Unicode scalar-value end offset
tok.byte_start # int — UTF-8 byte start offset
tok.byte_end # int — UTF-8 byte end offset 5
Build from source (optional)
# Requires Rust toolchain + maturin
pip install maturin
git clone https://github.com/preedep/kham
cd kham
maturin develop -m kham-python/Cargo.toml