Re: pg_kazsearch: Full-text search extension for Kazakh language

From: Darkhan <darkhanahmetov2005(at)gmail(dot)com>
To: Adrien Nayrat <adrien(dot)nayrat(at)anayrat(dot)info>
Cc: pgsql-general(at)lists(dot)postgresql(dot)org
Subject: Re: pg_kazsearch: Full-text search extension for Kazakh language
Date: 2026-04-08 14:55:26
Message-ID: CAOW9cErZJAZQT+5icb8KpDdPPwxLv0q5TEKBV8pzv4pgPmwQQA@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-general

Thanks for the suggestion!

I did look into Snowball early on. There is actually a Turkish stemmer in
Snowball already and Turkish is structurally very similar to Kazakh (both
agglutinative Turkic languages). But honestly the Turkish one is pretty
lobotomized, it only handles nominal suffixes and doesn’t account for verb
morphology at all. The author even mentions this in the comments. So it
kind of works for basic noun cases but falls apart on real text.

The reason I went with a standalone extension is that Kazakh has suffix
chains where vowel harmony interacts with each layer and you need
context-aware decisions, not just stripping patterns from the end of the
word. My stemmer uses a penalty-scored BFS over possible suffix
decompositions instead of the linear step-by-step stripping that Snowball
does. With 5-6 suffixes stacked on one word you really need to evaluate
multiple decomposition paths to find the best one.

That said contributing a simplified Kazakh stemmer to Snowball is something
I’d like to explore longer term. Even a basic version would be better than
nothing which is what exists today. Would need to figure out how much of
the BFS logic can fit into the Snowball language or if a simpler approach
gets close enough.

Appreciate the pointer!

Darkhan

On Wed, 8 Apr 2026 at 19:42 Adrien Nayrat <adrien(dot)nayrat(at)anayrat(dot)info>
wrote:

> On 4/5/26 3:32 PM, Darkhan wrote:
> > Hi all,
> >
> > I built pg_kazsearch, a PostgreSQL extension that adds full-text search
> > support for Kazakh. Currently there's no Kazakh dictionary, stemmer, or
> > stop word list available in PostgreSQL, so anyone searching Kazakh text
> is
> > stuck with trigram matching or application-level workarounds.
> >
> > Kazakh is agglutinative — a single word can carry 5-6 suffixes, which
> makes
> > standard search approaches miss most relevant results. pg_kazsearch
> > provides a custom Kazakh stemmer (core written in Rust), a stop word
> list,
> > and a text search dictionary that plugs into the standard PostgreSQL FTS
> > infrastructure — GIN indexes, ts_rank, phrase search all work out of the
> > box.
> >
> > I tested it on a dataset of 3,000 real Kazakh news articles. On the same
> > query, pg_kazsearch returns 61 relevant articles vs 1 with trigram
> search,
> > with a 23% improvement in recall overall.
> >
> > You can install it with a single command via deb package or Docker image,
> > no compilation needed.
> >
> > Repo: https://github.com/darkhanakh/pg-kazsearch
> >
> > I'd appreciate any feedback, especially from anyone working on text
> search
> > internals or with experience supporting non-Latin or agglutinative
> > languages in PostgreSQL.
> >
> > Thanks, Darkhan
> >
>
> Hello,
>
> Thanks for your work.
> I don't know anything about Kazakh.
>
> But have you try to add it to Snowball stemmer [1] ?
> As Postgres uses it, you have more chances to have Kazakh
> supported in future versions.
>
>
> 1: https://github.com/snowballstem/snowball
>
> --
> Adrien NAYRAT
> https://pro.anayrat.info
>

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Tom Lane 2026-04-08 14:59:29 Re: configure && --with
Previous Message Adrien Nayrat 2026-04-08 14:42:21 Re: pg_kazsearch: Full-text search extension for Kazakh language