kham is an open-source Thai NLP library written in Rust. It provides word segmentation at 33 MiB/s, POS tagging, NER, spell correction, phonetic encoding, RTGS romanization, and keyword extraction. It runs on Rust, Python, WebAssembly, C FFI, PostgreSQL, and SQLite.

How does kham compare to PyThaiNLP?

kham is approximately 15× faster than PyThaiNLP newmm, has zero external dependencies, and supports WebAssembly, PostgreSQL FTS, and SQLite FTS5 — none of which PyThaiNLP supports. Both achieve high F1 accuracy on Thai segmentation.

Can kham run offline or on mobile?

Yes. kham-core is a no_std Rust library with no network calls and no runtime dependencies. The WASM build is ~300 KB. The SQLite extension runs on Android and iOS for offline-first mobile search.

How do I install kham for Python?

Run: pip install kham. Then: import kham; words = kham.segment("กินข้าวกับปลา"). No additional setup is required.

v0.8.2 · Release notes →

The Fastest Thai NLP Library
for Production

kham is an open-source Thai NLP engine written in Rust — zero external dependencies, no_std core, and a complete pipeline from raw text to structured tokens.

Get started → Live demo Benchmarks

33 MiB/s

Segmentation throughput

F1 1.000

Accuracy on 228 test cases

62k words

Built-in dictionary

6 targets

Rust · Python · WASM · C · PG · SQLite

Complete Thai NLP pipeline

Every module is available in Rust, Python, WebAssembly, and C FFI — one library, all platforms.

✂️

Word Segmentation

Maximal matching over a 62,102-word DAWG dictionary. 33–34 MiB/s on Apple M-series. F1 1.000 on 228 curated test cases.

🏷️

POS Tagging

13-category part-of-speech tagger derived from the ORCHID tagset. Integrated into the FTS pipeline for lexeme-level filtering.

🔍

Named Entity Recognition

Person, Place, and Organization tags via a built-in gazetteer. NE tokens emit colocated lexemes in FTS for entity-aware search.

🔤

Spell Correction

Levenshtein ≤ 2 candidates re-ranked by lk82 phonetic similarity and TNC corpus frequency. Single-word and full-text modes.

🔊

Phonetic Encoding

Three Thai soundex systems: lk82 (general), udom83 (finer sibilants), MetaSound (per-syllable). Cross-language Thai–English phonetics.

🌐

RTGS Romanization

Royal Thai General System of Transcription. Sentence-level romanization: Thai/Named tokens transliterated, others pass through unchanged.

🔑

Keyword Extraction

TF × inverse-corpus-frequency scoring, stopwords excluded. Unigram and n-gram (bigram/trigram) keyphrase extraction.

📊

Number Normalization

Thai digit strings to ASCII equivalents, Thai word-to-number conversion, and baht text conversion for financial documents.

kham vs PyThaiNLP

Feature comparison for developers choosing a Thai NLP library.

Feature	kham	PyThaiNLP
Segmentation speed	33–34 MiB/s	~2 MiB/s (newmm)
F1 accuracy	1.000 (228 test cases)	≈ 0.94 (on kham test set)
Zero dependencies	Yes — no_std core	No — requires Python + torchnlp etc.
WebAssembly	Yes — 300 KB binary	No
PostgreSQL FTS	Native parser extension	No
SQLite FTS5	Loadable extension	No
Spell correction	Yes — Levenshtein + phonetic	Yes — edit distance only
Phonetic encoding	lk82, udom83, MetaSound	lk82 only
Offline / embedded	Yes — no network, no server	Partial
License	MIT OR Apache-2.0	Apache-2.0

PyThaiNLP data from public documentation and benchmarks. Run your own: python scripts/compare_pythainlp.py

Get started in 60 seconds

Pick your platform — the API is consistent across all targets.

Cargo.toml

[dependencies]
kham-core = "0.8"

pip

pip install kham

npm

npm install kham-wasm

Docker

docker run --rm -e POSTGRES_PASSWORD=secret \
  -p 5432:5432 nickmsft/kham-pg:latest

Full getting-started guide →

One library, every platform

kham-core compiles to native Rust, WASM, Python wheels, a C header, and two database extensions — from a single codebase.

🦀

Rust

kham-core — no_std, zero-copy, embedded-ready

🐍

Python

pip install kham — PyO3 bindings, full API

🌐

WebAssembly

npm install kham-wasm — browser + Node.js

⚙️

C / FFI

kham-capi — cbindgen header, any language

🐘

PostgreSQL

kham-pg — native FTS parser, PG 14–18

🗃️

SQLite

kham-sqlite — FTS5 tokenizer, offline-ready

Try Thai NLP in your browser

Thai text

Samples:

Open full playground →

Ready to add Thai NLP to your project?

Free, open-source, MIT OR Apache-2.0. No API keys, no rate limits, no server required.

Get started View on GitHub

The Fastest Thai NLP Library for Production