| From: | Philip Johnston <philip(at)pgcache(dot)com> |
|---|---|
| To: | Darkhan <darkhanahmetov2005(at)gmail(dot)com> |
| Cc: | Adrien Nayrat <adrien(dot)nayrat(at)anayrat(dot)info>, pgsql-general(at)lists(dot)postgresql(dot)org |
| Subject: | Re: pg_kazsearch: Full-text search extension for Kazakh language |
| Date: | 2026-04-10 15:24:38 |
| Message-ID: | CALLSp4xpGqx9jvwmCxrzKeG=UM+Se7c5LiwK-2CESC7iuXKhbQ@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-general |
Darkhan,
Great work! As a former archaeologist your comment about Kazakh being
agglutinative reminded me of ancient Sumerian which has a similar structure.
You might find some interest among philologists and ancient near eastern
historians for your work.
Philip
On Wed, Apr 8, 2026 at 9:56 AM Darkhan <darkhanahmetov2005(at)gmail(dot)com> wrote:
> Thanks for the suggestion!
>
> I did look into Snowball early on. There is actually a Turkish stemmer in
> Snowball already and Turkish is structurally very similar to Kazakh (both
> agglutinative Turkic languages). But honestly the Turkish one is pretty
> lobotomized, it only handles nominal suffixes and doesn’t account for verb
> morphology at all. The author even mentions this in the comments. So it
> kind of works for basic noun cases but falls apart on real text.
>
> The reason I went with a standalone extension is that Kazakh has suffix
> chains where vowel harmony interacts with each layer and you need
> context-aware decisions, not just stripping patterns from the end of the
> word. My stemmer uses a penalty-scored BFS over possible suffix
> decompositions instead of the linear step-by-step stripping that Snowball
> does. With 5-6 suffixes stacked on one word you really need to evaluate
> multiple decomposition paths to find the best one.
>
> That said contributing a simplified Kazakh stemmer to Snowball is
> something I’d like to explore longer term. Even a basic version would be
> better than nothing which is what exists today. Would need to figure out
> how much of the BFS logic can fit into the Snowball language or if a
> simpler approach gets close enough.
>
> Appreciate the pointer!
>
> Darkhan
>
> On Wed, 8 Apr 2026 at 19:42 Adrien Nayrat <adrien(dot)nayrat(at)anayrat(dot)info>
> wrote:
>
>> On 4/5/26 3:32 PM, Darkhan wrote:
>> > Hi all,
>> >
>> > I built pg_kazsearch, a PostgreSQL extension that adds full-text search
>> > support for Kazakh. Currently there's no Kazakh dictionary, stemmer, or
>> > stop word list available in PostgreSQL, so anyone searching Kazakh text
>> is
>> > stuck with trigram matching or application-level workarounds.
>> >
>> > Kazakh is agglutinative — a single word can carry 5-6 suffixes, which
>> makes
>> > standard search approaches miss most relevant results. pg_kazsearch
>> > provides a custom Kazakh stemmer (core written in Rust), a stop word
>> list,
>> > and a text search dictionary that plugs into the standard PostgreSQL FTS
>> > infrastructure — GIN indexes, ts_rank, phrase search all work out of the
>> > box.
>> >
>> > I tested it on a dataset of 3,000 real Kazakh news articles. On the same
>> > query, pg_kazsearch returns 61 relevant articles vs 1 with trigram
>> search,
>> > with a 23% improvement in recall overall.
>> >
>> > You can install it with a single command via deb package or Docker
>> image,
>> > no compilation needed.
>> >
>> > Repo: https://github.com/darkhanakh/pg-kazsearch
>> >
>> > I'd appreciate any feedback, especially from anyone working on text
>> search
>> > internals or with experience supporting non-Latin or agglutinative
>> > languages in PostgreSQL.
>> >
>> > Thanks, Darkhan
>> >
>>
>> Hello,
>>
>> Thanks for your work.
>> I don't know anything about Kazakh.
>>
>> But have you try to add it to Snowball stemmer [1] ?
>> As Postgres uses it, you have more chances to have Kazakh
>> supported in future versions.
>>
>>
>> 1: https://github.com/snowballstem/snowball
>>
>> --
>> Adrien NAYRAT
>> https://pro.anayrat.info
>>
>
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Ankush Mondal | 2026-04-11 16:11:22 | Begin contribution journey to postgres |
| Previous Message | Laurenz Albe | 2026-04-10 07:29:01 | Re: Pgbouncer and Node JS application Query read timeout error |