Benchmarks
Throughput and accuracy numbers for kham v0.5.0, measured on Apple M-series (arm64).
All criterion benchmarks run with --release and LTO enabled.
Environment
| CPU | Apple M-series (arm64) |
| OS | macOS |
| Rust | 1.85+ stable, LTO enabled |
| Profile | release |
| Built-in dictionary | 62,102 words · 669,387 DARTS states · 5.1 MiB |
| TNC frequency table | 106,125 entries |
Run locally: cargo bench -p kham-core
— HTML report at target/criterion/report/index.html
Accuracy
Word-boundary precision / recall / F1 against a CC0 gold corpus (kham-core/testdata/).
Also compared against PyThaiNLP word_tokenize(engine='newmm') on 39 benchmark sentences.
cargo run -p kham-bench-accuracy cargo run -p kham-bench-accuracy -- --threshold 0.95 # CI gate cargo run -p kham-bench-accuracy -- --verbose # show failing cases
Segmentation Throughput
Pure Thai input, built-in dictionary, criterion segment/by_length benchmark.
| Input | Size | Time (median) | Throughput |
|---|---|---|---|
| short | 37 B | 879 ns | 42.3 MiB/s |
| medium | 182 B | 3.80 µs | 45.1 MiB/s |
| long | 546 B | 10.9 µs | 47.1 MiB/s |
Mixed-script
Thai + Latin + Number in the same input — segment/mixed.
| Input | Size | Time (median) | Throughput |
|---|---|---|---|
| sparse (ธนาคาร100แห่ง) | 26 B | 744 ns | 42.3 MiB/s |
| medium (multi-boundary) | 74 B | 1.73 µs | 43.5 MiB/s |
| dense (alternating script) | 29 B | 535 ns | 55.3 MiB/s |
Normalization
Unicode normalization (สระลอย reordering, วรรณยุกต์ deduplication, NFC) — normalize/thai.
| Input | Size | Time (median) | Throughput |
|---|---|---|---|
| short | 37 B | 79.9 ns | 465 MiB/s |
| medium | 182 B | 199 ns | 864 MiB/s |
| long | 546 B | 507 ns | 1.0 GiB/s |
Dictionary
Construction
| Operation | Time (median) | Notes |
|---|---|---|
| builtin_dict() — binary blob load | 78 µs | pay-once startup cost |
| Dict::from_word_list — 62k words | 980 ms | only when merging a custom dict |
| Dict::from_word_list — 8-word list | 3.72 µs | small custom dict |
| dict/file/read_and_build — disk + build | 1.01 s | kham --dict <file> startup |
builtin_dict() is ~12,500× faster than
Dict::from_word_list because the DARTS trie is pre-compiled
by build.rs at compile time; runtime cost is a single O(S) binary decode pass.
Lookup
| Operation | Time (median) | Throughput |
|---|---|---|
| contains — hit (3-byte word กิน) | 7.1 ns | 1.18 GiB/s |
| contains — hit (12-byte word สวัสดี) | 18.3 ns | 940 MiB/s |
| contains — miss (ASCII non-word) | 744 ps | 7.5–8.8 GiB/s |
| prefixes — short anchor (7 B) | 42.3 ns | 473 MiB/s |
| prefixes — medium anchor (60 B) | 36.7 ns | 1.52 GiB/s |
| prefixes — long anchor (97 B) | 74.5 ns | 1.24 GiB/s |
SQLite FTS5 Extension
Criterion benchmarks via rusqlite with bundled SQLite (FTS5 enabled), in-memory database.
Pipeline per xTokenize call: normalize → NE tag → stopword → POS → synonym expand → RTGS romanization.
Indexing — INSERT throughput
| Benchmark | Input | Size | Time (median) | Throughput |
|---|---|---|---|---|
| index/single/short | กินข้าวกับปลา | 21 B | 15.5 µs | 2.47 MiB/s |
| index/single/medium | ~63 B Thai prose | 63 B | 41.8 µs | 4.14 MiB/s |
| index/single/long | 3× medium | 189 B | 94.3 µs | 5.46 MiB/s |
| index/single/mixed | Thai + Latin + Number | 37 B | 32.4 µs | 2.32 MiB/s |
| index/batch_100/short | 100 × short | 2.1 KB | 640 µs (6.4 µs/doc) | 6.0 MiB/s |
| index/batch_100/medium | 100 × medium | 6.3 KB | 2.54 ms (25.4 µs/doc) | 7.1 MiB/s |
| index/batch_100/long | 100 × long | 18.9 KB | 6.75 ms (67.5 µs/doc) | 7.6 MiB/s |
Query latency
Table pre-populated with 1,000 rows of medium input.
| Benchmark | Query | Result rows | Time (median) |
|---|---|---|---|
| query/single_word/thai_common | ข้าว | 1,000 | 88.3 µs |
| query/single_word/thai_rare | ปลา | 1,000 | 88.9 µs |
| query/single_word/number | 100 | 0 | 1.4 µs |
| query/single_word/latin | hello | 0 | 1.5 µs |
| query/snippet | ข้าว (top 10 snippets) | 10 | 417 µs |