💻

kham CLI

A command-line Thai NLP tool — segment text, inspect FTS pipelines, check spelling, extract keywords, and romanize Thai to RTGS Latin, all without writing a single line of code.

Install

From crates.io

cargo install kham-cli

Pre-built binaries (no Rust toolchain required)

Download a binary for your platform from the GitHub Releases page. Platforms: macOS arm64/x86_64, Linux x86_64, Windows x86_64.

# macOS / Linux — extract and move to PATH
tar xzf kham-cli-0.8.2-aarch64-apple-darwin.tar.gz
sudo mv kham /usr/local/bin/

# Verify
kham --version

Build from source

git clone https://github.com/preedep/kham
cd kham
cargo build -p kham-cli --release
# Binary at: target/release/kham

Basic segmentation

Pass Thai text as an argument. Tokens are separated by | by default.

kham "กินข้าวกับปลา"
# กิน|ข้าว|กับ|ปลา

# Mixed-script input
kham "Hello กรุงเทพ 2024"
# Hello| |กรุงเทพ| |2024

# Multiple sentences
kham "กินข้าว" "เที่ยวทะเล"
# กิน|ข้าว
# เที่ยว|ทะเล

Output flags

Control what information is attached to each token.

`--kind` — token script category

kham --kind "Hello กรุงเทพ 2024"
# Hello:Latin| :Whitespace|กรุงเทพ:Thai| :Whitespace|2024:Number

`--spans` — Unicode char offsets

kham --spans "กินข้าว"
# กิน:0-3|ข้าว:3-7

`--kind --spans` — combined

kham --kind --spans "กินข้าว"
# กิน:Thai:0-3|ข้าว:Thai:3-7

`--sep` — custom separator

kham --sep " " "กินข้าวกับปลา"
# กิน ข้าว กับ ปลา

kham --sep $'
' "กินข้าวกับปลา"
# กิน
# ข้าว
# กับ
# ปลา

`--whitespace` — include whitespace tokens

# By default whitespace tokens are hidden
kham "กิน ข้าว"
# กิน|ข้าว

# Include them explicitly
kham --whitespace "กิน ข้าว"
# กิน| |ข้าว

`--normalize` — Unicode normalization before segmenting

Reorders floating vowels (สระลอย), deduplicates tone marks, and composes sara am (อำ) before segmenting. Use this when your input may come from multiple keyboards or encodings.

kham --normalize "กินข้าวกับปลา"
# กิน|ข้าว|กับ|ปลา

FTS mode (`--fts`)

Switches to the FtsTokenizer pipeline used by the PostgreSQL and SQLite extensions. Each token is printed on its own line with tab-separated metadata fields — ideal for inspecting what the search index will contain.

Basic FTS output

kham --fts "กินข้าวกับปลา"
# text    kind  pos         ne    stop   syn
# กิน     Thai  verb        -     false  กิน
# ข้าว    Thai  noun        -     false  ข้าว
# กับ     Thai  preposition -     true   (empty — stopword suppressed)
# ปลา     Thai  noun        -     false  ปลา

Fields: text · kind=KIND · pos=POS · ne=NE · stop=BOOL · syn=SYNONYMS

`--soundex` — add phonetic code to `syn=`

Valid algorithms: lk82 (default), udom83, metasound. Only applies to Thai and Named tokens.

kham --fts --soundex lk82 "กินข้าวกับปลา"
# กิน    Thai  verb  -  false  กิน	1500
# ข้าว   Thai  noun  -  false  ข้าว	5800
# กับ    Thai  prep  -  true   (stopword)
# ปลา    Thai  noun  -  false  ปลา	4800

Named entity inspection

kham --fts "กรุงเทพมหานครเป็นเมืองหลวงของไทย"
# กรุงเทพมหานคร  Thai  noun  place  false  กรุงเทพมหานคร
# เป็น           Thai  verb  -      false  เป็น
# เมืองหลวง     Thai  noun  -      false  เมืองหลวง
# ของ            Thai  prep  -      true   (stopword)
# ไทย            Thai  noun  place  false  ไทย

Confidence scores

Every token carries a confidence score: 0.0 for unknown tokens, 1.0 for unambiguous dictionary matches, intermediate values reflect TNC corpus frequency and boundary ambiguity.

`--confidence` — show score per token

kham --confidence "กินข้าวกับปลา"
# กิน:Thai:conf=0.92|ข้าว:Thai:conf=0.98|กับ:Thai:conf=1.00|ปลา:Thai:conf=0.99

kham --kind --confidence "กินข้าวกับปลา"
# กิน:Thai:conf=0.92|ข้าว:Thai:conf=0.98|กับ:Thai:conf=1.00|ปลา:Thai:conf=0.99

`--min-confidence` — filter low-confidence tokens

Useful for extracting only high-quality tokens from noisy input. Works in both basic and FTS mode.

# Only tokens with confidence ≥ 0.9
kham --min-confidence 0.9 "ข้อความที่มีคำผิดและคำถูก"
# ข้อความ|ที่มี|คำ|ถูก

# Combined with FTS mode
kham --fts --min-confidence 0.85 "กินข้าวกับปลา"

Spell checking (`--spell`)

Returns ranked spelling suggestions for the input word. Candidates have edit distance ≤ 2 and are sorted by: phonetic similarity (lk82 soundex match), edit distance, then TNC corpus frequency.

Basic spell check

kham --spell "กีนข้าว"
# Suggestions for "กีนข้าว":
#   1. กินข้าว   edit=1  soundex=true
#   2. กินข้าวๆ  edit=2  soundex=true
#   3. กินข้าวร้อน  edit=3  soundex=false
# …

# A correctly spelled word returns itself at edit distance 0
kham --spell "กิน"
# Suggestions for "กิน":
#   1. กิน  edit=0  soundex=true

`--top-n` — limit results

kham --spell --top-n 3 "กีนข้าว"
# Suggestions for "กีนข้าว":
#   1. กินข้าว   edit=1  soundex=true
#   2. กินข้าวๆ  edit=2  soundex=true
#   3. กินข้าวร้อน  edit=3  soundex=false

Spell-check a word list via stdin

# Each line is treated as an independent word
printf 'กีนข้าว
กรุงเทฑ
ปลา
' | kham --spell --top-n 1

Keyword extraction (`--keywords`)

Extracts the most distinctive words and phrases from the text, scored by TF × inverse-corpus-frequency. Stopwords and single-character tokens are excluded. Output has two sections: unigram keywords and n-gram keyphrases (bigrams / trigrams).

Extract keywords from text

kham --keywords "นักวิทยาศาสตร์ค้นพบดาวเคราะห์ใหม่ในระบบสุริยะ ดาวดวงนี้โคจรอยู่ใกล้ดาวเคราะห์น้อย"
# Keywords:
#   1. ดาวเคราะห์   score=0.4821  count=2
#   2. ระบบสุริยะ   score=0.3210  count=1
#   3. นักวิทยาศาสตร์  score=0.2980  count=1
#
# Keyphrases:
#   1. ดาวเคราะห์ใหม่    score=0.3100
#   2. ระบบสุริยะ ดาว    score=0.1900

`--top-n` — limit results per section

kham --keywords --top-n 5 "ข้อความยาว…"

Process documents via stdin

# Each line is treated as an independent document
cat article.txt | kham --keywords --top-n 10

RTGS romanization (`--romanize`)

Segments the text and transliterates every Thai/Named token to RTGS Latin. Non-Thai tokens (numbers, Latin, punctuation, whitespace) pass through unchanged.

kham --romanize "กินข้าวกับปลา"
# kin khao kap pla

kham --romanize "กรุงเทพมหานครเป็นเมืองหลวงของไทย"
# krung thep maha nakhon pen mueang luang khong thai

# Mixed-script: non-Thai tokens pass through
kham --romanize "Hello กรุงเทพ 2024"
# Hello krung thep 2024

Output formats (`--format`)

All modes support three output formats: text (default), json, and csv.

JSON output

kham --format json "กินข้าวกับปลา"
# [{"text":"กิน","kind":"Thai","char_start":0,"char_end":3,"confidence":0.92},
#  {"text":"ข้าว","kind":"Thai","char_start":3,"char_end":7,"confidence":0.98},
#  {"text":"กับ","kind":"Thai","char_start":7,"char_end":10,"confidence":1.0},
#  {"text":"ปลา","kind":"Thai","char_start":10,"char_end":13,"confidence":0.99}]

# Pretty-print with jq
kham --format json "กินข้าว" | jq .

CSV output

kham --format csv "กินข้าวกับปลา"
# text,kind,char_start,char_end,confidence
# กิน,Thai,0,3,0.92
# ข้าว,Thai,3,7,0.98
# กับ,Thai,7,10,1.0
# ปลา,Thai,10,13,0.99

JSON in FTS mode

kham --fts --format json "กินข้าว"
# [{"text":"กิน","kind":"Thai","pos":"verb","ne":null,"stop":false,"syn":["กิน"],"confidence":0.92},
#  {"text":"ข้าว","kind":"Thai","pos":"noun","ne":null,"stop":false,"syn":["ข้าว"],"confidence":0.98}]

Custom dictionary (`--dict`)

Merge domain-specific words into the built-in dictionary. The file should be plain text with one word per line. The custom words are overlaid on top of the built-in 62k-word dictionary — no trie rebuild required.

# custom_words.txt — one word per line
# (e.g., brand names, technical terms, acronyms)

kham --dict custom_words.txt "ข้อความที่มีคำเฉพาะทาง"

# Combine with other flags
kham --dict domain.txt --kind --confidence "ข้อความ"

Note: --dict is incompatible with --fts. When both are given, --dict is ignored with a warning.

Stdin and pipe usage

When no argument is given, kham reads from stdin — one input per line, one output per line.

# Pipe from echo
echo "กินข้าวกับปลา" | kham

# Process a file line by line
cat sentences.txt | kham --kind --format json

# Spell-check a word list
cat words.txt | kham --spell --top-n 1

# Extract keywords from all lines in a document
cat article.txt | kham --keywords --top-n 5

# Use with grep — keep only lines that contain ปลา after segmenting
cat corpus.txt | kham --sep $'
' | grep 'ปลา'

Flag reference

Flag	Default	Description
--dict <FILE>	—	Custom word list (plain text, one word per line). Incompatible with --fts.
--sep <STR>	\|	Token separator in text output mode.
--whitespace	off	Include whitespace tokens in output.
--normalize	off	Normalize text before segmenting (สระลอย, tone dedup, NFC).
--kind	off	Append token kind: กิน:Thai.
--spans	off	Append Unicode char span: กิน:0-3.
--fts	off	Switch to FtsTokenizer; one token per line with tab-separated fields.
--soundex <ALGO>	lk82	Phonetic code in FTS mode: lk82, udom83, metasound. Requires --fts.
--confidence	off	Append conf=<val> per token in text mode.
--min-confidence <MIN>	0.0	Filter to tokens with confidence ≥ MIN (0.0–1.0).
--format <FMT>	text	Output format: text, json, csv.
--romanize	off	Segment and romanize Thai to RTGS Latin. Incompatible with --fts.
--spell	off	Spell-check mode: ranked suggestions per input word.
--keywords	off	Keyword mode: TF×IDF keywords and keyphrases.
--top-n <N>	10	Max results in --spell or --keywords mode.

Mode priority: --spell > --keywords > --fts > basic (includes --romanize). Conflicting mode flags emit a warning.

← All targets Try the live demo →

kham CLI

Install

From crates.io

Pre-built binaries (no Rust toolchain required)

Build from source

Basic segmentation

Output flags

--kind — token script category

--spans — Unicode char offsets

--kind --spans — combined

--sep — custom separator

--whitespace — include whitespace tokens

--normalize — Unicode normalization before segmenting

FTS mode (--fts)