ประสิทธิภาพ

ตัวเลข throughput และความแม่นยำของ kham v0.8.2 วัดบน Apple M-series (arm64) benchmark ทั้งหมดรันด้วย --release และ LTO เปิดใช้งาน

สภาพแวดล้อม

CPU	Apple M-series (arm64)
OS	macOS
Rust	1.85+ stable, LTO เปิดใช้งาน
Profile	release
dictionary ในตัว	62,102 คำ · 670,460 DARTS states · 5.1 MiB
ตาราง TNC frequency	106,125 รายการ

รันเองได้: cargo bench -p kham-core — รายงาน HTML ที่ target/criterion/report/index.html

ความแม่นยำ

Precision / Recall / F1 ระดับขอบเขตคำเทียบกับ gold corpus แบบ CC0 (kham-core/testdata/) และเปรียบเทียบกับ PyThaiNLP word_tokenize(engine='newmm') บน 39 ประโยค

1.000

Micro F1

เทียบกับ CC0 gold corpus (228 test cases)

94.9%

ประโยคที่ตรงกัน

37/39 เทียบกับ PyThaiNLP newmm

ความต่างที่แท้จริง

2 รายการที่เหลือเป็นข้อผิดพลาดของ PyThaiNLP

วิธีรันเอง:

cargo run -p kham-bench-accuracy
cargo run -p kham-bench-accuracy -- --threshold 0.95   # CI gate
cargo run -p kham-bench-accuracy -- --verbose           # แสดงกรณีที่ผิดพลาด

Throughput การตัดคำ

ข้อความภาษาไทยล้วน, dictionary ในตัว, benchmark segment/by_length

ขนาด input	Bytes	เวลา (median)	Throughput
สั้น (short)	39 B	1.12 µs	33.2 MiB/s
กลาง (medium)	180 B	5.09 µs	33.7 MiB/s
ยาว (long)	540 B	14.95 µs	34.4 MiB/s

หลายภาษาผสมกัน (Mixed-script)

ภาษาไทย + ภาษาอังกฤษ + ตัวเลขในข้อความเดียวกัน — segment/mixed

Input	Bytes	เวลา (median)	Throughput
sparse (ธนาคาร100แห่ง)	33 B	948 ns	33.2 MiB/s
medium (หลาย boundary)	79 B	2.23 µs	33.8 MiB/s
dense (สลับภาษา)	31 B	770 ns	38.4 MiB/s

Normalization

Unicode normalization (จัดเรียงสระลอย, ลบวรรณยุกต์ซ้ำ, NFC) — normalize/thai

ขนาด input	Bytes	เวลา (median)	Throughput
สั้น	39 B	107.6 ns	345 MiB/s
กลาง	180 B	285.9 ns	600 MiB/s
ยาว	540 B	743.7 ns	692 MiB/s

SQLite FTS5 Extension

Criterion benchmarks ผ่าน rusqlite, in-memory database, FTS5 เปิดใช้งาน pipeline ต่อการเรียก xTokenize: normalize → NE → stopword → POS → synonym → RTGS

Throughput การ index (INSERT)

Benchmark	Input	ขนาด	เวลา (median)	Throughput
index/single/short	กินข้าวกับปลา	39 B	20.2 µs	1.84 MiB/s
index/single/medium	ข้อความภาษาไทย ~60 chars	180 B	64.8 µs	2.65 MiB/s
index/single/long	3× medium	540 B	122.3 µs	4.21 MiB/s
index/batch_100/short	100 × short	3.9 KB	813 µs (8.1 µs/doc)	4.57 MiB/s
index/batch_100/medium	100 × medium	18.0 KB	3.67 ms (36.7 µs/doc)	4.68 MiB/s
index/batch_100/long	100 × long	54.0 KB	8.84 ms (88.4 µs/doc)	5.82 MiB/s

Latency การค้นหา

ตาราง 1,000 แถว ข้อความขนาดกลาง

Benchmark	Query	แถวผลลัพธ์	เวลา (median)
query/single_word/thai_common	ข้าว	1,000	4.28 µs
query/single_word/thai_rare	ปลา	1,000	139.7 µs
query/single_word/number	100	0	2.00 µs
query/single_word/latin	hello	0	2.10 µs
query/snippet	ข้าว (LIMIT 10)	10	4.24 µs

PostgreSQL FTS Extension

วัดภายใน Docker (PostgreSQL 17, Linux ARM64) ผ่าน make -C kham-pg bench pipeline ต่อการเรียก to_tsvector: normalize → segment → NE tag → lk82 soundex → RTGS romanization

Throughput ของ to_tsvector & plainto_tsquery

batch throughput ผ่าน generate_series หมุนเวียน 3 input ต่อขนาด เพื่อป้องกัน function-result cache ของ PostgreSQL pipeline ต่อการเรียก: normalize → segment → NE tag → lk82 soundex → RTGS romanization

การดำเนินการ	ขนาด document	ops/s	µs/op
to_tsvector ขนาดเล็ก	~63 B	15,146,925	0.066
to_tsvector ขนาดกลาง	~630 B	14,367,816	0.070
to_tsvector ขนาดใหญ่	~6.3 KB	8,771,930	0.114
plainto_tsquery (1 คำ)	—	15,964,240	0.063
plainto_tsquery (3 คำ)	—	16,425,756	0.061

วิธีรันเอง:

make -C kham-pg bench

← อ้างอิง API บันทึกการเปลี่ยนแปลง →