Skip to main content

Benchmarks

Throughput and accuracy numbers for kham v0.8.2, measured on Apple M-series (arm64). All criterion benchmarks run with --release and LTO enabled.

Environment

CPU Apple M-series (arm64)
OS macOS
Rust 1.85+ stable, LTO enabled
Profile release
Built-in dictionary 62,102 words · 670,460 DARTS states · 5.1 MiB
TNC frequency table 106,125 entries

Run locally: cargo bench -p kham-core — HTML report at target/criterion/report/index.html

Accuracy

Word-boundary precision / recall / F1 against a CC0 gold corpus (kham-core/testdata/). Also compared against PyThaiNLP word_tokenize(engine='newmm') on 39 benchmark sentences.

1.000
Micro F1
vs CC0 gold corpus (228 test cases)
94.9%
Sentence agreement
37/39 vs PyThaiNLP newmm
0
Genuine diffs
2 remaining are confirmed PyThaiNLP errors
How to reproduce:
cargo run -p kham-bench-accuracy
cargo run -p kham-bench-accuracy -- --threshold 0.95   # CI gate
cargo run -p kham-bench-accuracy -- --verbose           # show failing cases

Segmentation Throughput

Pure Thai input, built-in dictionary, criterion segment/by_length benchmark.

InputSizeTime (median)Throughput
short39 B1.12 µs33.2 MiB/s
medium180 B5.09 µs33.7 MiB/s
long540 B14.95 µs34.4 MiB/s

Mixed-script

Thai + Latin + Number in the same input — segment/mixed.

InputSizeTime (median)Throughput
sparse (ธนาคาร100แห่ง)33 B948 ns33.2 MiB/s
medium (multi-boundary)79 B2.23 µs33.8 MiB/s
dense (alternating script)31 B770 ns38.4 MiB/s

Normalization

Unicode normalization (สระลอย reordering, วรรณยุกต์ deduplication, NFC) — normalize/thai.

InputSizeTime (median)Throughput
short39 B107.6 ns345 MiB/s
medium180 B285.9 ns600 MiB/s
long540 B743.7 ns692 MiB/s

Dictionary

Construction

OperationTime (median)Notes
builtin_dict() — binary blob load92.9 µspay-once startup cost
Tokenizer::new() — full startup33.4 msdict + freq + NE + POS tables
FtsTokenizer::new()46.7 msadds synonym + RTGS + soundex tables
Dict::from_word_list — 62k words1.63 sonly when merging a custom dict
Dict::from_word_list — 8-word list5.0 µssmall custom dict

builtin_dict() is ~17,500× faster than Dict::from_word_list because the DARTS trie is pre-compiled by build.rs at compile time; runtime cost is a single O(S) binary decode pass.

Lookup

OperationTime (median)Throughput
contains — hit (9-byte word กิน)11.1 ns770 MiB/s
contains — hit (18-byte word สวัสดี)29.2 ns587 MiB/s
contains — miss (ASCII non-word)1.22 ns~4.5 GiB/s
prefixes — short anchor (21 B)63.8 ns314 MiB/s
prefixes — medium anchor (60 B)55.9 ns1.0 GiB/s
prefixes — long anchor (99 B)100.0 ns944 MiB/s

SQLite FTS5 Extension

Criterion benchmarks via rusqlite with bundled SQLite (FTS5 enabled), in-memory database. Pipeline per xTokenize call: normalize → NE tag → stopword → POS → synonym expand → RTGS romanization.

Indexing — INSERT throughput

BenchmarkInputSizeTime (median)Throughput
index/single/shortกินข้าวกับปลา39 B20.2 µs1.84 MiB/s
index/single/mediumThai prose ~60 chars180 B64.8 µs2.65 MiB/s
index/single/long3× medium540 B122.3 µs4.21 MiB/s
index/single/mixedThai + Latin + Number79 B41.8 µs1.80 MiB/s
index/batch_100/short100 × short3.9 KB813 µs (8.1 µs/doc)4.57 MiB/s
index/batch_100/medium100 × medium18.0 KB3.67 ms (36.7 µs/doc)4.68 MiB/s
index/batch_100/long100 × long54.0 KB8.84 ms (88.4 µs/doc)5.82 MiB/s

Query latency

Table pre-populated with 1,000 rows of medium input.

BenchmarkQueryResult rowsTime (median)
query/single_word/thai_commonข้าว1,0004.28 µs
query/single_word/thai_rareปลา1,000139.7 µs
query/single_word/number10002.00 µs
query/single_word/latinhello02.10 µs
query/snippetข้าว (LIMIT 10 snippets)104.24 µs

PostgreSQL FTS Extension

Measured inside Docker (PostgreSQL 17, Linux ARM64) via make -C kham-pg bench. Pipeline per to_tsvector call: normalize → segment → NE tag → lk82 soundex → RTGS romanization.

to_tsvector & plainto_tsquery throughput

Batch throughput via generate_series with 3 input variants per size, cycling to defeat PG's function-result cache. Pipeline per call: normalize → segment → NE tag → lk82 soundex → RTGS romanization.

OperationDoc sizeops/sµs/op
to_tsvector small~63 B15,146,9250.066
to_tsvector medium~630 B14,367,8160.070
to_tsvector large~6.3 KB 8,771,9300.114
plainto_tsquery (1 word)15,964,2400.063
plainto_tsquery (3 words)16,425,7560.061
How to reproduce:
make -C kham-pg bench