Skip to main content
🐍

Python

kham's Python bindings are built with PyO3 and expose the full segmentation API — including rich Token objects with kind, POS, NE, and span information.

1

Install

pip install kham
2

Basic segmentation

import kham

# Returns a list of strings
words = kham.segment("กินข้าวกับปลา")
print(words)
# ['กิน', 'ข้าว', 'กับ', 'ปลา']

# Mixed-script input is handled correctly
mixed = kham.segment("Hello กรุงเทพ 2024!")
print(mixed)
# ['Hello', ' ', 'กรุงเทพ', ' ', '2024', '!']
3

Rich Token objects

segment_tokens() returns Token objects with text, kind, and Unicode char spans — ideal for NLP pipelines.

import kham

tokens = kham.segment_tokens("กินข้าวกับปลา")
for tok in tokens:
    print(f"{tok.text!r:10}  kind={tok.kind:12}  chars={tok.char_start}..{tok.char_end}")

# 'กิน'       kind=Thai          chars=0..3
# 'ข้าว'      kind=Thai          chars=3..7
# 'กับ'       kind=Thai          chars=7..10
# 'ปลา'       kind=Thai          chars=10..13
4

Token fields

Every Token object exposes the following attributes:

tok.text        # str  — the token text
tok.kind        # str  — "Thai" | "Latin" | "Number" | "Punctuation" | "Emoji" | "Whitespace" | "Unknown"
tok.char_start  # int  — Unicode scalar-value start offset (use with str.slice())
tok.char_end    # int  — Unicode scalar-value end offset
tok.byte_start  # int  — UTF-8 byte start offset
tok.byte_end    # int  — UTF-8 byte end offset
5

Build from source (optional)

# Requires Rust toolchain + maturin
pip install maturin
git clone https://github.com/preedep/kham
cd kham
maturin develop -m kham-python/Cargo.toml