Skip to main content
🗃️

SQLite FTS5

kham-sqlite is a loadable SQLite extension that provides a custom FTS5 tokenizer. It splits Thai text correctly so that FTS5 MATCH queries work on Thai documents.

1

Prerequisites

SQLite with FTS5 enabled and the ability to load extensions. On macOS, the system sqlite3 disables load_extension — install via Homebrew instead.

brew install sqlite
# Use /opt/homebrew/opt/sqlite/bin/sqlite3, not /usr/bin/sqlite3
2

Build the extension

# From the kham monorepo root
cargo build -p kham-sqlite --release

# The shared library is at:
#   Linux:  target/release/libkham_sqlite.so
#   macOS:  target/release/libkham_sqlite.dylib
3

Load and create a FTS5 table

.load ./target/release/libkham_sqlite

CREATE VIRTUAL TABLE docs USING fts5(
  title,
  body,
  tokenize = 'kham'
);
4

Insert and search

INSERT INTO docs VALUES
  ('อาหารไทย',  'กินข้าวกับปลาและผัก'),
  ('เมืองหลวง', 'กรุงเทพมหานครเป็นเมืองหลวงของประเทศไทย');

-- Simple MATCH
SELECT title, body FROM docs WHERE docs MATCH 'ปลา';

-- BM25 ranking
SELECT title, bm25(docs) AS score
FROM docs
WHERE docs MATCH 'เมือง'
ORDER BY score;

-- Snippet highlight
SELECT snippet(docs, 1, '<b>', '</b>', '…', 10)
FROM docs
WHERE docs MATCH 'กรุงเทพ';
5

tokenize arguments

All arguments are optional and order-independent. File paths must be single-quoted inside the tokenize directive (FTS5 barewords only allow A–Z, 0–9, and _). Escape the quotes for SQL with '' (two single quotes).

-- lk82 soundex (default) — near-homophones share a code
CREATE VIRTUAL TABLE t1 USING fts5(body, tokenize='kham');

-- udom83 soundex (finer sibilant/liquid distinctions)
CREATE VIRTUAL TABLE t2 USING fts5(body, tokenize='kham soundex udom83');

-- Disable soundex entirely
CREATE VIRTUAL TABLE t3 USING fts5(body, tokenize='kham soundex none');

-- Suppress stopwords from the index
CREATE VIRTUAL TABLE t4 USING fts5(body, tokenize='kham stopwords on');

-- Bigrams for OOV/unknown tokens (default: trigrams; 0 = disabled)
CREATE VIRTUAL TABLE t5 USING fts5(body, tokenize='kham ngram_size 2');

-- Custom synonym map (path must be single-quoted)
CREATE VIRTUAL TABLE t6 USING fts5(body,
  tokenize='kham synonyms ''/etc/kham/synonyms.tsv''');

-- Custom domain word list (overlaid on built-in dictionary, no trie rebuild)
CREATE VIRTUAL TABLE t7 USING fts5(body,
  tokenize='kham dict ''/etc/kham/domain_words.txt''');

-- All options combined
CREATE VIRTUAL TABLE t8 USING fts5(body,
  tokenize='kham soundex lk82 stopwords on ngram_size 2
            dict ''/words.txt'' synonyms ''/syns.tsv''');
6

Ranking and field boosting

FTS5 provides BM25 scoring built-in. Lower (more negative) scores rank higher. Pass column weight arguments to boost title over body. Phonetic/soundex matches receive the same BM25 weight as exact token matches — to downweight them, rerank in application code after fetching FTS candidates.

-- Basic BM25 ranking (lower value = better match)
SELECT title, body, bm25(docs) AS score
FROM docs
WHERE docs MATCH 'ปลา'
ORDER BY score;

-- Field boosting: title column weight 10×, body column weight 1×
-- Arguments follow the column order in CREATE VIRTUAL TABLE
CREATE VIRTUAL TABLE docs USING fts5(title, body, tokenize='kham');

SELECT title, bm25(docs, 10.0, 1.0) AS score
FROM docs
WHERE docs MATCH 'ปลา'
ORDER BY score;

-- Exact match vs phonetic match: FTS5 weights them equally.
-- Application-level reranking for phonetic downweighting:
--   1. Fetch candidates with FTS5 MATCH (fast, index-backed)
--   2. In app code, tokenize the query and check if each matched token
--      is an exact dictionary word (edit distance 0)
--   3. Multiply the BM25 score by 0.3 for phonetic-only matches