SQLite FTS5
kham-sqlite is a loadable SQLite extension that provides a custom FTS5 tokenizer. It splits Thai text correctly so that FTS5 MATCH queries work on Thai documents.
Prerequisites
SQLite with FTS5 enabled and the ability to load extensions. On macOS, the system sqlite3 disables load_extension — install via Homebrew instead.
brew install sqlite
# Use /opt/homebrew/opt/sqlite/bin/sqlite3, not /usr/bin/sqlite3 Build the extension
# From the kham monorepo root
cargo build -p kham-sqlite --release
# The shared library is at:
# Linux: target/release/libkham_sqlite.so
# macOS: target/release/libkham_sqlite.dylib Load and create a FTS5 table
.load ./target/release/libkham_sqlite
CREATE VIRTUAL TABLE docs USING fts5(
title,
body,
tokenize = 'kham'
); Insert and search
INSERT INTO docs VALUES
('อาหารไทย', 'กินข้าวกับปลาและผัก'),
('เมืองหลวง', 'กรุงเทพมหานครเป็นเมืองหลวงของประเทศไทย');
-- Simple MATCH
SELECT title, body FROM docs WHERE docs MATCH 'ปลา';
-- BM25 ranking
SELECT title, bm25(docs) AS score
FROM docs
WHERE docs MATCH 'เมือง'
ORDER BY score;
-- Snippet highlight
SELECT snippet(docs, 1, '<b>', '</b>', '…', 10)
FROM docs
WHERE docs MATCH 'กรุงเทพ'; tokenize arguments
All arguments are optional and order-independent. File paths must be single-quoted inside the tokenize directive (FTS5 barewords only allow A–Z, 0–9, and _). Escape the quotes for SQL with '' (two single quotes).
-- lk82 soundex (default) — near-homophones share a code
CREATE VIRTUAL TABLE t1 USING fts5(body, tokenize='kham');
-- udom83 soundex (finer sibilant/liquid distinctions)
CREATE VIRTUAL TABLE t2 USING fts5(body, tokenize='kham soundex udom83');
-- Disable soundex entirely
CREATE VIRTUAL TABLE t3 USING fts5(body, tokenize='kham soundex none');
-- Suppress stopwords from the index
CREATE VIRTUAL TABLE t4 USING fts5(body, tokenize='kham stopwords on');
-- Bigrams for OOV/unknown tokens (default: trigrams; 0 = disabled)
CREATE VIRTUAL TABLE t5 USING fts5(body, tokenize='kham ngram_size 2');
-- Custom synonym map (path must be single-quoted)
CREATE VIRTUAL TABLE t6 USING fts5(body,
tokenize='kham synonyms ''/etc/kham/synonyms.tsv''');
-- Custom domain word list (overlaid on built-in dictionary, no trie rebuild)
CREATE VIRTUAL TABLE t7 USING fts5(body,
tokenize='kham dict ''/etc/kham/domain_words.txt''');
-- All options combined
CREATE VIRTUAL TABLE t8 USING fts5(body,
tokenize='kham soundex lk82 stopwords on ngram_size 2
dict ''/words.txt'' synonyms ''/syns.tsv'''); Ranking and field boosting
FTS5 provides BM25 scoring built-in. Lower (more negative) scores rank higher. Pass column weight arguments to boost title over body. Phonetic/soundex matches receive the same BM25 weight as exact token matches — to downweight them, rerank in application code after fetching FTS candidates.
-- Basic BM25 ranking (lower value = better match)
SELECT title, body, bm25(docs) AS score
FROM docs
WHERE docs MATCH 'ปลา'
ORDER BY score;
-- Field boosting: title column weight 10×, body column weight 1×
-- Arguments follow the column order in CREATE VIRTUAL TABLE
CREATE VIRTUAL TABLE docs USING fts5(title, body, tokenize='kham');
SELECT title, bm25(docs, 10.0, 1.0) AS score
FROM docs
WHERE docs MATCH 'ปลา'
ORDER BY score;
-- Exact match vs phonetic match: FTS5 weights them equally.
-- Application-level reranking for phonetic downweighting:
-- 1. Fetch candidates with FTS5 MATCH (fast, index-backed)
-- 2. In app code, tokenize the query and check if each matched token
-- is an exact dictionary word (edit distance 0)
-- 3. Multiply the BM25 score by 0.3 for phonetic-only matches