🐍

Python

Python bindings ของ kham สร้างด้วย PyO3 เปิดให้ใช้งาน API ตัดคำแบบเต็มรูปแบบ รวมถึง Token object พร้อม kind, POS, NE และข้อมูล span

ติดตั้ง

pip install kham

uv add kham

poetry add kham

ตัดคำพื้นฐาน

import kham

words = kham.segment("กินข้าวกับปลา")
print(words)
# ['กิน', 'ข้าว', 'กับ', 'ปลา']

mixed = kham.segment("Hello กรุงเทพ 2024!")
print(mixed)

Token object แบบ rich

import kham

tokens = kham.segment_tokens("กินข้าวกับปลา")
for tok in tokens:
    print(f"{tok.text!r:10}  kind={tok.kind:12}  chars={tok.char_start}..{tok.char_end}")

Field ของ Token

tok.text        # str   — ข้อความ token
tok.kind        # str   — "Thai" | "Latin" | "Number" | "Punctuation" | ...
tok.char_start  # int   — จุดเริ่มต้น Unicode (ใช้กับ str.slice())
tok.char_end    # int   — จุดสิ้นสุด Unicode
tok.byte_start  # int   — byte offset UTF-8 เริ่มต้น
tok.byte_end    # int   — byte offset UTF-8 สิ้นสุด
tok.confidence  # float — 0.0 (Unknown) … 1.0 (dict match ที่มีความมั่นใจสูง)

ตรวจสอบคำสะกด

spell_suggestions() ค้นหาคำใน dictionary ที่มี edit distance ≤ 2 จัดอันดับด้วย lk82 phonetic และ TNC frequency รับคำเดียว — หากมีหลายคำให้ segment ก่อน

import kham

# spell_suggestions(word, max_n) → list[SpellSuggestion]
suggs = kham.spell_suggestions("กีนข้าว", 5)
for s in suggs:
    print(f"{s.word:12} edit={s.edit_distance} soundex={s.soundex_match} freq={s.freq_score}")
# กินข้าว  edit=1  soundex=True  freq=…

# SpellSuggestion: .word  .edit_distance  .soundex_match  .freq_score

สกัดคำสำคัญ

extract_keywords() คืนคำที่โดดเด่นที่สุดในเอกสาร คำนวณด้วย TF × inverse-corpus-frequency ตัด stopword ออกโดยอัตโนมัติ

import kham

text = ("นักวิทยาศาสตร์ค้นพบดาวเคราะห์ใหม่ในระบบสุริยะ "
        "ดาวดวงนี้โคจรอยู่ใกล้ดาวเคราะห์น้อย")

keywords = kham.extract_keywords(text, 5)
for kw in keywords:
    print(f"{kw.word:12} score={kw.score:.4f} count={kw.count}")

# Keyword: .word  .score  .count

← เริ่มต้นใช้งาน ทดลองใช้งาน →