Skip to main content
PostgreSQL 14–18 · SQLite FTS5

Thai Full-Text Search that Actually Works

Thai text has no spaces. Standard FTS engines produce garbage results. kham replaces the parser with a proper Thai segmenter — phonetic matching, stopwords, and POS filtering included.

The problem with standard FTS for Thai

Thai text has no whitespace between words. Standard FTS parsers — including PostgreSQL's built-in parser and SQLite's default tokenizer — split text on whitespace. On Thai text, this means the entire sentence becomes a single un-searchable token.

to_tsvector with built-in parser
'กินข้าวกับปลา':1
← single token, cannot search "ปลา"

How kham solves it

kham replaces the parser with a proper Thai segmenter. Each word is a separate, searchable token, enriched with phonetic codes and POS tags.

to_tsvector with kham parser
'กิน':1 'ข้าว':2 'ปลา':4
'4800':4 'pla':4 'pos_noun':4
← word tokens + soundex + RTGS + POS

Search features

Every feature works in both PostgreSQL and SQLite.

🔊

Phonetic matching

Thai soundex codes (lk82 / udom83 / MetaSound) stored alongside each word — near-homophones and misspellings match automatically.

🚫

Stopword suppression

600+ Thai stopwords suppressed from the index. Common words like กับ, และ, ใน, ที่ do not pollute result ranking.

🏷️

POS lexeme filtering

Part-of-speech tags are stored as colocated lexemes — filter results to nouns, verbs, or named entities in one tsquery.

🌐

RTGS romanization

Latin-script queries match Thai documents via RTGS romanization. Search "pla" and find ปลา automatically.

🔢

Number normalization

Thai digit strings (๑๒๓) indexed alongside ASCII equivalents (123) — cross-script numeric queries always match.

High throughput

PostgreSQL: 15M+ ops/s. SQLite FTS5: 4–6 MiB/s bulk index throughput. Both benchmarked on Apple M-series.

🐘

PostgreSQL — native text search parser

kham-pg is a PostgreSQL text-search parser extension. It integrates with to_tsvector, ts_headline, and ts_rank natively. Supports PostgreSQL 14–18 with pre-built Docker images for fastest setup.

# PostgreSQL 17 with kham_pg pre-installed — no Rust toolchain needed
# Available tags: v0.8.2-pg14  v0.8.2-pg16  v0.8.2-pg17  v0.8.2-pg18
docker run --rm \
  -e POSTGRES_PASSWORD=secret \
  -p 5432:5432 \
  nickmsft/kham-pg:v0.8.2-pg17

# Connect and use immediately
psql -h localhost -U postgres -c "
  CREATE EXTENSION kham_pg;
  SELECT to_tsvector('kham', 'กินข้าวกับปลา');
"
-- 1. Create the extension
CREATE EXTENSION kham_pg;

-- 2. Create a table with a tsvector column
CREATE TABLE articles (
  id    SERIAL PRIMARY KEY,
  title TEXT,
  body  TEXT,
  tsv   TSVECTOR
);

-- 3. Populate the index
UPDATE articles
SET tsv = to_tsvector('kham', coalesce(title,'') || ' ' || coalesce(body,''));

-- 4. GIN index for fast queries
CREATE INDEX articles_tsv_idx ON articles USING GIN(tsv);

-- 5. Search — phonetic match "pla" finds ปลา
SELECT id, ts_headline('kham', body, query, 'StartSel=<b>, StopSel=</b>') AS snippet
FROM articles, to_tsquery('kham', 'pla') AS query
WHERE tsv @@ query
ORDER BY ts_rank(tsv, query) DESC
LIMIT 10;
Full PostgreSQL integration guide →
🗃️

SQLite — offline-first Thai search

kham-sqlite is a loadable FTS5 tokenizer extension. Ideal for mobile apps, desktop apps, and edge deployments where there is no database server. Runs on Android, iOS, macOS, Linux, and Windows.

-- 1. Load the extension
.load ./target/release/libkham_sqlite

-- 2. Create an FTS5 virtual table
CREATE VIRTUAL TABLE docs USING fts5(
  title,
  body,
  tokenize = 'kham soundex lk82 stopwords on'
);

-- 3. Insert documents
INSERT INTO docs VALUES
  ('อาหารไทย', 'กินข้าวกับปลาและผัก'),
  ('เมืองหลวง', 'กรุงเทพมหานครเป็นเมืองหลวง');

-- 4. Search — BM25 ranked results
SELECT title, bm25(docs) AS score
FROM docs
WHERE docs MATCH 'ปลา'
ORDER BY score;

-- 5. Snippet highlighting
SELECT title, snippet(docs, 1, '<b>', '</b>', '…', 10)
FROM docs
WHERE docs MATCH 'กรุงเทพ';
Full SQLite integration guide →

Performance

15M+ ops/s
PostgreSQL to_tsvector
~63 B document, Linux ARM64
4–6 MiB/s
SQLite bulk index
in-memory, FTS5 full pipeline
33 MiB/s
Segmentation
pure Thai, Apple M-series

Full benchmark results →

Add Thai search to your app today

Free, open-source, MIT OR Apache-2.0. Docker images for zero-setup PostgreSQL — no Rust toolchain required.