Skip to main content

Benchmarks

Throughput and accuracy numbers for kham v0.5.0, measured on Apple M-series (arm64). All criterion benchmarks run with --release and LTO enabled.

Environment

CPU Apple M-series (arm64)
OS macOS
Rust 1.85+ stable, LTO enabled
Profile release
Built-in dictionary 62,102 words · 669,387 DARTS states · 5.1 MiB
TNC frequency table 106,125 entries

Run locally: cargo bench -p kham-core — HTML report at target/criterion/report/index.html

Accuracy

Word-boundary precision / recall / F1 against a CC0 gold corpus (kham-core/testdata/). Also compared against PyThaiNLP word_tokenize(engine='newmm') on 39 benchmark sentences.

0.975
Micro F1
vs CC0 gold corpus
94.9%
Sentence agreement
37/39 vs PyThaiNLP newmm
0
Genuine diffs
2 remaining are confirmed PyThaiNLP errors
How to reproduce:
cargo run -p kham-bench-accuracy
cargo run -p kham-bench-accuracy -- --threshold 0.95   # CI gate
cargo run -p kham-bench-accuracy -- --verbose           # show failing cases

Segmentation Throughput

Pure Thai input, built-in dictionary, criterion segment/by_length benchmark.

InputSizeTime (median)Throughput
short37 B879 ns42.3 MiB/s
medium182 B3.80 µs45.1 MiB/s
long546 B10.9 µs47.1 MiB/s

Mixed-script

Thai + Latin + Number in the same input — segment/mixed.

InputSizeTime (median)Throughput
sparse (ธนาคาร100แห่ง)26 B744 ns42.3 MiB/s
medium (multi-boundary)74 B1.73 µs43.5 MiB/s
dense (alternating script)29 B535 ns55.3 MiB/s

Normalization

Unicode normalization (สระลอย reordering, วรรณยุกต์ deduplication, NFC) — normalize/thai.

InputSizeTime (median)Throughput
short37 B79.9 ns465 MiB/s
medium182 B199 ns864 MiB/s
long546 B507 ns1.0 GiB/s

Dictionary

Construction

OperationTime (median)Notes
builtin_dict() — binary blob load78 µspay-once startup cost
Dict::from_word_list — 62k words980 msonly when merging a custom dict
Dict::from_word_list — 8-word list3.72 µssmall custom dict
dict/file/read_and_build — disk + build1.01 skham --dict <file> startup

builtin_dict() is ~12,500× faster than Dict::from_word_list because the DARTS trie is pre-compiled by build.rs at compile time; runtime cost is a single O(S) binary decode pass.

Lookup

OperationTime (median)Throughput
contains — hit (3-byte word กิน)7.1 ns1.18 GiB/s
contains — hit (12-byte word สวัสดี)18.3 ns940 MiB/s
contains — miss (ASCII non-word)744 ps7.5–8.8 GiB/s
prefixes — short anchor (7 B)42.3 ns473 MiB/s
prefixes — medium anchor (60 B)36.7 ns1.52 GiB/s
prefixes — long anchor (97 B)74.5 ns1.24 GiB/s

SQLite FTS5 Extension

Criterion benchmarks via rusqlite with bundled SQLite (FTS5 enabled), in-memory database. Pipeline per xTokenize call: normalize → NE tag → stopword → POS → synonym expand → RTGS romanization.

Indexing — INSERT throughput

BenchmarkInputSizeTime (median)Throughput
index/single/shortกินข้าวกับปลา21 B15.5 µs2.47 MiB/s
index/single/medium~63 B Thai prose63 B41.8 µs4.14 MiB/s
index/single/long3× medium189 B94.3 µs5.46 MiB/s
index/single/mixedThai + Latin + Number37 B32.4 µs2.32 MiB/s
index/batch_100/short100 × short2.1 KB640 µs (6.4 µs/doc)6.0 MiB/s
index/batch_100/medium100 × medium6.3 KB2.54 ms (25.4 µs/doc)7.1 MiB/s
index/batch_100/long100 × long18.9 KB6.75 ms (67.5 µs/doc)7.6 MiB/s

Query latency

Table pre-populated with 1,000 rows of medium input.

BenchmarkQueryResult rowsTime (median)
query/single_word/thai_commonข้าว1,00088.3 µs
query/single_word/thai_rareปลา1,00088.9 µs
query/single_word/number10001.4 µs
query/single_word/latinhello01.5 µs
query/snippetข้าว (top 10 snippets)10417 µs