Skip to main content
🌐

WebAssembly / npm

kham-wasm brings Thai word segmentation to the browser and Node.js with zero server dependencies. The WASM binary is ~300 KB and includes the full dictionary.

1

Install

npm install kham-wasm
# or: yarn add kham-wasm  /  pnpm add kham-wasm
2

Browser — ES module

Import init and call it once to load the .wasm file, then use segment() or segment_tokens() freely.

import init, { segment, segment_tokens } from 'kham-wasm';

// Load the WebAssembly module (fetches kham_wasm_bg.wasm)
await init();

// Segment into strings
const words = segment("กินข้าวกับปลา");
console.log(words);
// ["กิน", "ข้าว", "กับ", "ปลา"]

// Rich tokens
const tokens = segment_tokens("Hello กรุงเทพ 2024");
tokens.forEach(t => {
  console.log(t.text, t.kind, t.char_start, t.char_end);
});
3

Lazy loading (recommended)

Load the WASM in the background so it does not block the initial page render.

let kham: Awaited<ReturnType<typeof import('kham-wasm')['default']>> | null = null;

async function getKham() {
  if (kham) return kham;
  const mod = await import('kham-wasm');
  await mod.default();       // init
  kham = mod;
  return kham;
}

// Preload silently in the background
getKham().catch(console.error);

async function onSegment(text: string) {
  const { segment_tokens } = await getKham();
  return segment_tokens(text);
}
4

Token fields

// Token object returned by segment_tokens()
tok.text        // string — the token text
tok.kind        // string — "Thai" | "Latin" | "Number" | "Punctuation" | "Emoji" | "Whitespace" | "Unknown"
tok.char_start  // number — Unicode scalar-value start (JS string index for BMP text)
tok.char_end    // number — Unicode scalar-value end
tok.byte_start  // number — UTF-8 byte start offset
tok.byte_end    // number — UTF-8 byte end offset
tok.confidence  // number — 0.0 (Unknown) … 1.0 (high-confidence dict match)
5

Spell checking

spell_suggestions(word, maxN) returns dictionary candidates within edit distance 2, ranked by lk82 phonetic similarity and TNC corpus frequency.

import init, { spell_suggestions } from 'kham-wasm';
await init();

const suggs = spell_suggestions("กีนข้าว", 5);
for (const s of suggs) {
  console.log(s.word, 'edit:', s.edit_distance,
              'soundex:', s.soundex_match, 'freq:', s.freq_score);
}
// กินข้าว  edit: 1  soundex: true  freq: …

// SpellSuggestion: .word  .edit_distance  .soundex_match  .freq_score
6

Keyword extraction

extract_keywords(text, maxN) returns the most distinctive words scored by TF × inverse-corpus-frequency. Stopwords are excluded.

import init, { extract_keywords } from 'kham-wasm';
await init();

const text = "นักวิทยาศาสตร์ค้นพบดาวเคราะห์ใหม่ในระบบสุริยะ " +
             "ดาวดวงนี้โคจรอยู่ใกล้ดาวเคราะห์น้อย";

const keywords = extract_keywords(text, 5);
for (const kw of keywords) {
  console.log(kw.word, 'score:', kw.score.toFixed(4), 'count:', kw.count);
}

// Keyword: .word  .score  .count
7

Build from source

# Requires Rust + wasm-pack
curl https://rustwasm.github.io/wasm-pack/installer/init.sh -sSf | sh

git clone https://github.com/preedep/kham
cd kham
wasm-pack build kham-wasm --target web --release
# Output: kham-wasm/pkg/