Benchmarks
Throughput and accuracy numbers for kham v0.8.2, measured on Apple M-series (arm64).
All criterion benchmarks run with --release and LTO enabled.
Environment
| CPU | Apple M-series (arm64) |
| OS | macOS |
| Rust | 1.85+ stable, LTO enabled |
| Profile | release |
| Built-in dictionary | 62,102 words · 670,460 DARTS states · 5.1 MiB |
| TNC frequency table | 106,125 entries |
Run locally: cargo bench -p kham-core
— HTML report at target/criterion/report/index.html
Accuracy
Word-boundary precision / recall / F1 against a CC0 gold corpus (kham-core/testdata/).
Also compared against PyThaiNLP word_tokenize(engine='newmm') on 39 benchmark sentences.
cargo run -p kham-bench-accuracy cargo run -p kham-bench-accuracy -- --threshold 0.95 # CI gate cargo run -p kham-bench-accuracy -- --verbose # show failing cases
Segmentation Throughput
Pure Thai input, built-in dictionary, criterion segment/by_length benchmark.
| Input | Size | Time (median) | Throughput |
|---|---|---|---|
| short | 39 B | 1.12 µs | 33.2 MiB/s |
| medium | 180 B | 5.09 µs | 33.7 MiB/s |
| long | 540 B | 14.95 µs | 34.4 MiB/s |
Mixed-script
Thai + Latin + Number in the same input — segment/mixed.
| Input | Size | Time (median) | Throughput |
|---|---|---|---|
| sparse (ธนาคาร100แห่ง) | 33 B | 948 ns | 33.2 MiB/s |
| medium (multi-boundary) | 79 B | 2.23 µs | 33.8 MiB/s |
| dense (alternating script) | 31 B | 770 ns | 38.4 MiB/s |
Normalization
Unicode normalization (สระลอย reordering, วรรณยุกต์ deduplication, NFC) — normalize/thai.
| Input | Size | Time (median) | Throughput |
|---|---|---|---|
| short | 39 B | 107.6 ns | 345 MiB/s |
| medium | 180 B | 285.9 ns | 600 MiB/s |
| long | 540 B | 743.7 ns | 692 MiB/s |
Dictionary
Construction
| Operation | Time (median) | Notes |
|---|---|---|
| builtin_dict() — binary blob load | 92.9 µs | pay-once startup cost |
| Tokenizer::new() — full startup | 33.4 ms | dict + freq + NE + POS tables |
| FtsTokenizer::new() | 46.7 ms | adds synonym + RTGS + soundex tables |
| Dict::from_word_list — 62k words | 1.63 s | only when merging a custom dict |
| Dict::from_word_list — 8-word list | 5.0 µs | small custom dict |
builtin_dict() is ~17,500× faster than
Dict::from_word_list because the DARTS trie is pre-compiled
by build.rs at compile time; runtime cost is a single O(S) binary decode pass.
Lookup
| Operation | Time (median) | Throughput |
|---|---|---|
| contains — hit (9-byte word กิน) | 11.1 ns | 770 MiB/s |
| contains — hit (18-byte word สวัสดี) | 29.2 ns | 587 MiB/s |
| contains — miss (ASCII non-word) | 1.22 ns | ~4.5 GiB/s |
| prefixes — short anchor (21 B) | 63.8 ns | 314 MiB/s |
| prefixes — medium anchor (60 B) | 55.9 ns | 1.0 GiB/s |
| prefixes — long anchor (99 B) | 100.0 ns | 944 MiB/s |
SQLite FTS5 Extension
Criterion benchmarks via rusqlite with bundled SQLite (FTS5 enabled), in-memory database.
Pipeline per xTokenize call: normalize → NE tag → stopword → POS → synonym expand → RTGS romanization.
Indexing — INSERT throughput
| Benchmark | Input | Size | Time (median) | Throughput |
|---|---|---|---|---|
| index/single/short | กินข้าวกับปลา | 39 B | 20.2 µs | 1.84 MiB/s |
| index/single/medium | Thai prose ~60 chars | 180 B | 64.8 µs | 2.65 MiB/s |
| index/single/long | 3× medium | 540 B | 122.3 µs | 4.21 MiB/s |
| index/single/mixed | Thai + Latin + Number | 79 B | 41.8 µs | 1.80 MiB/s |
| index/batch_100/short | 100 × short | 3.9 KB | 813 µs (8.1 µs/doc) | 4.57 MiB/s |
| index/batch_100/medium | 100 × medium | 18.0 KB | 3.67 ms (36.7 µs/doc) | 4.68 MiB/s |
| index/batch_100/long | 100 × long | 54.0 KB | 8.84 ms (88.4 µs/doc) | 5.82 MiB/s |
Query latency
Table pre-populated with 1,000 rows of medium input.
| Benchmark | Query | Result rows | Time (median) |
|---|---|---|---|
| query/single_word/thai_common | ข้าว | 1,000 | 4.28 µs |
| query/single_word/thai_rare | ปลา | 1,000 | 139.7 µs |
| query/single_word/number | 100 | 0 | 2.00 µs |
| query/single_word/latin | hello | 0 | 2.10 µs |
| query/snippet | ข้าว (LIMIT 10 snippets) | 10 | 4.24 µs |
PostgreSQL FTS Extension
Measured inside Docker (PostgreSQL 17, Linux ARM64) via make -C kham-pg bench.
Pipeline per to_tsvector call:
normalize → segment → NE tag → lk82 soundex → RTGS romanization.
to_tsvector & plainto_tsquery throughput
Batch throughput via generate_series with 3 input variants per size, cycling to defeat PG's function-result cache.
Pipeline per call: normalize → segment → NE tag → lk82 soundex → RTGS romanization.
| Operation | Doc size | ops/s | µs/op |
|---|---|---|---|
| to_tsvector small | ~63 B | 15,146,925 | 0.066 |
| to_tsvector medium | ~630 B | 14,367,816 | 0.070 |
| to_tsvector large | ~6.3 KB | 8,771,930 | 0.114 |
| plainto_tsquery (1 word) | — | 15,964,240 | 0.063 |
| plainto_tsquery (3 words) | — | 16,425,756 | 0.061 |
make -C kham-pg bench