Skip to main content
🐍

Python

kham's Python bindings are built with PyO3 and expose the full segmentation API — including rich Token objects with kind, POS, NE, and span information.

1

Install

pip install kham
2

Basic segmentation

import kham

# Returns a list of strings
words = kham.segment("กินข้าวกับปลา")
print(words)
# ['กิน', 'ข้าว', 'กับ', 'ปลา']

# Mixed-script input is handled correctly
mixed = kham.segment("Hello กรุงเทพ 2024!")
print(mixed)
# ['Hello', ' ', 'กรุงเทพ', ' ', '2024', '!']
3

Rich Token objects

segment_tokens() returns Token objects with text, kind, and Unicode char spans — ideal for NLP pipelines.

import kham

tokens = kham.segment_tokens("กินข้าวกับปลา")
for tok in tokens:
    print(f"{tok.text!r:10}  kind={tok.kind:12}  chars={tok.char_start}..{tok.char_end}")

# 'กิน'       kind=Thai          chars=0..3
# 'ข้าว'      kind=Thai          chars=3..7
# 'กับ'       kind=Thai          chars=7..10
# 'ปลา'       kind=Thai          chars=10..13
4

Token fields

Every Token object exposes the following attributes:

tok.text        # str   — the token text
tok.kind        # str   — "Thai" | "Latin" | "Number" | "Punctuation" | "Emoji" | "Whitespace" | "Unknown"
tok.char_start  # int   — Unicode scalar-value start offset (use with str.slice())
tok.char_end    # int   — Unicode scalar-value end offset
tok.byte_start  # int   — UTF-8 byte start offset
tok.byte_end    # int   — UTF-8 byte end offset
tok.confidence  # float — 0.0 (Unknown) … 1.0 (high-confidence dict match)
5

Spell checking

spell_suggestions() finds dictionary words within edit distance 2, ranked by phonetic similarity (lk82) then TNC corpus frequency. Pass a single word — segment first for multi-word text.

import kham

# Single misspelled word → ranked suggestions
suggs = kham.spell_suggestions("กีนข้าว", 5)
for s in suggs:
    print(f"{s.word:12} edit={s.edit_distance} soundex={s.soundex_match} freq={s.freq_score}")
# กินข้าว  edit=1  soundex=True  freq=…

# Correct word → edit_distance 0
top = kham.spell_suggestions("กิน", 1)
print(top[0].word, top[0].edit_distance)  # กิน  0

# SpellSuggestion: .word  .edit_distance  .soundex_match  .freq_score
6

Keyword extraction

extract_keywords() returns the most distinctive words in a document, scored by TF × inverse-corpus-frequency. Stopwords and single-character tokens are excluded.

import kham

text = ("นักวิทยาศาสตร์ค้นพบดาวเคราะห์ใหม่ในระบบสุริยะ "
        "ดาวดวงนี้โคจรอยู่ใกล้ดาวเคราะห์น้อย")

keywords = kham.extract_keywords(text, 5)
for kw in keywords:
    print(f"{kw.word:12} score={kw.score:.4f} count={kw.count}")

# Keyword: .word  .score  .count
7

Build from source (optional)

# Requires Rust toolchain + maturin
pip install maturin
git clone https://github.com/preedep/kham
cd kham
maturin develop -m kham-python/Cargo.toml