Benchmarks

Throughput and accuracy numbers for kham v0.8.2, measured on Apple M-series (arm64). All criterion benchmarks run with --release and LTO enabled.

Environment

CPU	Apple M-series (arm64)
OS	macOS
Rust	1.85+ stable, LTO enabled
Profile	release
Built-in dictionary	62,102 words · 670,460 DARTS states · 5.1 MiB
TNC frequency table	106,125 entries

Run locally: cargo bench -p kham-core — HTML report at target/criterion/report/index.html

Accuracy

Word-boundary precision / recall / F1 against a CC0 gold corpus (kham-core/testdata/). Also compared against PyThaiNLP word_tokenize(engine='newmm') on 39 benchmark sentences.

1.000

Micro F1

vs CC0 gold corpus (228 test cases)

94.9%

Sentence agreement

37/39 vs PyThaiNLP newmm

Genuine diffs

2 remaining are confirmed PyThaiNLP errors

How to reproduce:

cargo run -p kham-bench-accuracy
cargo run -p kham-bench-accuracy -- --threshold 0.95   # CI gate
cargo run -p kham-bench-accuracy -- --verbose           # show failing cases

Segmentation Throughput

Pure Thai input, built-in dictionary, criterion segment/by_length benchmark.

Input	Size	Time (median)	Throughput
short	39 B	1.12 µs	33.2 MiB/s
medium	180 B	5.09 µs	33.7 MiB/s
long	540 B	14.95 µs	34.4 MiB/s

Mixed-script

Thai + Latin + Number in the same input — segment/mixed.

Input	Size	Time (median)	Throughput
sparse (ธนาคาร100แห่ง)	33 B	948 ns	33.2 MiB/s
medium (multi-boundary)	79 B	2.23 µs	33.8 MiB/s
dense (alternating script)	31 B	770 ns	38.4 MiB/s

Normalization

Unicode normalization (สระลอย reordering, วรรณยุกต์ deduplication, NFC) — normalize/thai.

Input	Size	Time (median)	Throughput
short	39 B	107.6 ns	345 MiB/s
medium	180 B	285.9 ns	600 MiB/s
long	540 B	743.7 ns	692 MiB/s

Dictionary

Construction

Operation	Time (median)	Notes
builtin_dict() — binary blob load	92.9 µs	pay-once startup cost
Tokenizer::new() — full startup	33.4 ms	dict + freq + NE + POS tables
FtsTokenizer::new()	46.7 ms	adds synonym + RTGS + soundex tables
Dict::from_word_list — 62k words	1.63 s	only when merging a custom dict
Dict::from_word_list — 8-word list	5.0 µs	small custom dict

builtin_dict() is ~17,500× faster than Dict::from_word_list because the DARTS trie is pre-compiled by build.rs at compile time; runtime cost is a single O(S) binary decode pass.

Lookup

Operation	Time (median)	Throughput
contains — hit (9-byte word กิน)	11.1 ns	770 MiB/s
contains — hit (18-byte word สวัสดี)	29.2 ns	587 MiB/s
contains — miss (ASCII non-word)	1.22 ns	~4.5 GiB/s
prefixes — short anchor (21 B)	63.8 ns	314 MiB/s
prefixes — medium anchor (60 B)	55.9 ns	1.0 GiB/s
prefixes — long anchor (99 B)	100.0 ns	944 MiB/s

SQLite FTS5 Extension

Criterion benchmarks via rusqlite with bundled SQLite (FTS5 enabled), in-memory database. Pipeline per xTokenize call: normalize → NE tag → stopword → POS → synonym expand → RTGS romanization.

Indexing — INSERT throughput

Benchmark	Input	Size	Time (median)	Throughput
index/single/short	กินข้าวกับปลา	39 B	20.2 µs	1.84 MiB/s
index/single/medium	Thai prose ~60 chars	180 B	64.8 µs	2.65 MiB/s
index/single/long	3× medium	540 B	122.3 µs	4.21 MiB/s
index/single/mixed	Thai + Latin + Number	79 B	41.8 µs	1.80 MiB/s
index/batch_100/short	100 × short	3.9 KB	813 µs (8.1 µs/doc)	4.57 MiB/s
index/batch_100/medium	100 × medium	18.0 KB	3.67 ms (36.7 µs/doc)	4.68 MiB/s
index/batch_100/long	100 × long	54.0 KB	8.84 ms (88.4 µs/doc)	5.82 MiB/s

Query latency

Table pre-populated with 1,000 rows of medium input.

Benchmark	Query	Result rows	Time (median)
query/single_word/thai_common	ข้าว	1,000	4.28 µs
query/single_word/thai_rare	ปลา	1,000	139.7 µs
query/single_word/number	100	0	2.00 µs
query/single_word/latin	hello	0	2.10 µs
query/snippet	ข้าว (LIMIT 10 snippets)	10	4.24 µs

PostgreSQL FTS Extension

Measured inside Docker (PostgreSQL 17, Linux ARM64) via make -C kham-pg bench. Pipeline per to_tsvector call: normalize → segment → NE tag → lk82 soundex → RTGS romanization.

to_tsvector & plainto_tsquery throughput

Batch throughput via generate_series with 3 input variants per size, cycling to defeat PG's function-result cache. Pipeline per call: normalize → segment → NE tag → lk82 soundex → RTGS romanization.

Operation	Doc size	ops/s	µs/op
to_tsvector small	~63 B	15,146,925	0.066
to_tsvector medium	~630 B	14,367,816	0.070
to_tsvector large	~6.3 KB	8,771,930	0.114
plainto_tsquery (1 word)	—	15,964,240	0.063
plainto_tsquery (3 words)	—	16,425,756	0.061

How to reproduce:

make -C kham-pg bench

← API Reference Changelog →