[PROPOSAL] Improvements of Hunspell dictionaries support

From: Artur Zakirov <a(dot)zakirov(at)postgrespro(dot)ru>
To: pgsql-hackers(at)postgresql(dot)org
Subject: [PROPOSAL] Improvements of Hunspell dictionaries support
Date: 2015-10-20 14:00:40
Message-ID: 56264908.2020203@postgrespro.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi.

Introduction
============

PostgreSQL full-text search extension uses dictionaries from the various
open source spell checker software to perform word normalization.

Currently, Ispell, MySpell and Hunspell dictionaries are supported.

Dictionaries requires two files: a dictionary file and an affix file. A
dictionary file contains a list of words. Each word may be followed by
one or more affix flags. An affix file contains a lot of parameters,
definitions, prefix and suffix classes used in a dictionary file.

Most complete and actively developed are Hunspell dictionaries
(http://hunspell.sourceforge.net/). OpenOffice and LibreOffice projects
recently switched from MySpell to Hunspell dictionaries.

But PostgreSQL is unable to load recent version of Hunsplell
dictionaries for several languages.

It is because affix files of these dictionaries grow too big.
Traditionally affix rules are named by one extended ASCII (8-bit)
symbol. And if there is more than 192 rules, some syntax extension is
needed.

And to handle these dictionaries Hunspell have FLAG parameter with the
following values:
* FLAG long - sets the double extended ASCII character flag type
* FLAG num - sets the decimal number flag type (from 1 to 65000)

These flag types are used in affix files of such dictionaries as ar,
br_fr, ca, ca_valencia, da_dk, en_ca, en_gb, en_us, fr, gl_es, is,
ne_np, nl_nl, si_lk (from
http://cgit.freedesktop.org/libreoffice/dictionaries/tree/). But
PostgreSQL does not support FLAG parameter and can not load these
dictionaries.

There is also AF parameter which allows to substitute affix flag sets
with ordinal numbers in affix and dictionary files.

FLAG and AF parameters are not supported by PostgreSQL. Supporting these
parameters allows to load dictionaries listed above into PostgreSQL
database and use them in full text search.

Proposed Changes
================

Internal representation of the dictionary in the PostgreSQL doesn't
impose too strict limits on the number of affix rules. There are a
flagval array, which size must be increased from 256 to 65000.

All other changes is the changes in the affix file parsing code to
properly parse long and numeric flags.

I've already implemented support for FLAG long, it require relatively
small patch size (60 lines). Support for FLAG num would require
comparable amount of code.

These changes would allow to use recent versions of Hunspell
dictionaries for following dictionaries:
br_fr, ca, ca_valencia, da_dk, gl_es, is, ne_np, nl_nl, si_lk.

Implementation of AF flag would allow to support also following
dictionaries:
ar, en_ca, en_gb, en_us, fr, hu_hu.

Expected Results
================

These changes would allow to use more recent and complete spelling
dictionaries to perform word stemming during full-text indexing.

--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2015-10-20 14:01:51 Re: Allow ssl_renegotiation_limit in PG 9.5
Previous Message Robert Haas 2015-10-20 13:41:49 Re: More work on SortSupport for text - strcoll() and strxfrm() caching