| From: | Darkhan <darkhanahmetov2005(at)gmail(dot)com> |
|---|---|
| To: | Adrien Nayrat <adrien(dot)nayrat(at)anayrat(dot)info> |
| Cc: | pgsql-general(at)lists(dot)postgresql(dot)org |
| Subject: | Re: pg_kazsearch: Full-text search extension for Kazakh language |
| Date: | 2026-04-08 14:55:26 |
| Message-ID: | CAOW9cErZJAZQT+5icb8KpDdPPwxLv0q5TEKBV8pzv4pgPmwQQA@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-general |
Thanks for the suggestion!
I did look into Snowball early on. There is actually a Turkish stemmer in
Snowball already and Turkish is structurally very similar to Kazakh (both
agglutinative Turkic languages). But honestly the Turkish one is pretty
lobotomized, it only handles nominal suffixes and doesn’t account for verb
morphology at all. The author even mentions this in the comments. So it
kind of works for basic noun cases but falls apart on real text.
The reason I went with a standalone extension is that Kazakh has suffix
chains where vowel harmony interacts with each layer and you need
context-aware decisions, not just stripping patterns from the end of the
word. My stemmer uses a penalty-scored BFS over possible suffix
decompositions instead of the linear step-by-step stripping that Snowball
does. With 5-6 suffixes stacked on one word you really need to evaluate
multiple decomposition paths to find the best one.
That said contributing a simplified Kazakh stemmer to Snowball is something
I’d like to explore longer term. Even a basic version would be better than
nothing which is what exists today. Would need to figure out how much of
the BFS logic can fit into the Snowball language or if a simpler approach
gets close enough.
Appreciate the pointer!
Darkhan
On Wed, 8 Apr 2026 at 19:42 Adrien Nayrat <adrien(dot)nayrat(at)anayrat(dot)info>
wrote:
> On 4/5/26 3:32 PM, Darkhan wrote:
> > Hi all,
> >
> > I built pg_kazsearch, a PostgreSQL extension that adds full-text search
> > support for Kazakh. Currently there's no Kazakh dictionary, stemmer, or
> > stop word list available in PostgreSQL, so anyone searching Kazakh text
> is
> > stuck with trigram matching or application-level workarounds.
> >
> > Kazakh is agglutinative — a single word can carry 5-6 suffixes, which
> makes
> > standard search approaches miss most relevant results. pg_kazsearch
> > provides a custom Kazakh stemmer (core written in Rust), a stop word
> list,
> > and a text search dictionary that plugs into the standard PostgreSQL FTS
> > infrastructure — GIN indexes, ts_rank, phrase search all work out of the
> > box.
> >
> > I tested it on a dataset of 3,000 real Kazakh news articles. On the same
> > query, pg_kazsearch returns 61 relevant articles vs 1 with trigram
> search,
> > with a 23% improvement in recall overall.
> >
> > You can install it with a single command via deb package or Docker image,
> > no compilation needed.
> >
> > Repo: https://github.com/darkhanakh/pg-kazsearch
> >
> > I'd appreciate any feedback, especially from anyone working on text
> search
> > internals or with experience supporting non-Latin or agglutinative
> > languages in PostgreSQL.
> >
> > Thanks, Darkhan
> >
>
> Hello,
>
> Thanks for your work.
> I don't know anything about Kazakh.
>
> But have you try to add it to Snowball stemmer [1] ?
> As Postgres uses it, you have more chances to have Kazakh
> supported in future versions.
>
>
> 1: https://github.com/snowballstem/snowball
>
> --
> Adrien NAYRAT
> https://pro.anayrat.info
>
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Tom Lane | 2026-04-08 14:59:29 | Re: configure && --with |
| Previous Message | Adrien Nayrat | 2026-04-08 14:42:21 | Re: pg_kazsearch: Full-text search extension for Kazakh language |