|From:||Artur Zakirov <a(dot)zakirov(at)postgrespro(dot)ru>|
|Subject:||[PROPOSAL] Improvements of Hunspell dictionaries support|
|Views:||Raw Message | Whole Thread | Download mbox | Resend email|
PostgreSQL full-text search extension uses dictionaries from the various
open source spell checker software to perform word normalization.
Currently, Ispell, MySpell and Hunspell dictionaries are supported.
Dictionaries requires two files: a dictionary file and an affix file. A
dictionary file contains a list of words. Each word may be followed by
one or more affix flags. An affix file contains a lot of parameters,
definitions, prefix and suffix classes used in a dictionary file.
Most complete and actively developed are Hunspell dictionaries
(http://hunspell.sourceforge.net/). OpenOffice and LibreOffice projects
recently switched from MySpell to Hunspell dictionaries.
But PostgreSQL is unable to load recent version of Hunsplell
dictionaries for several languages.
It is because affix files of these dictionaries grow too big.
Traditionally affix rules are named by one extended ASCII (8-bit)
symbol. And if there is more than 192 rules, some syntax extension is
And to handle these dictionaries Hunspell have FLAG parameter with the
* FLAG long - sets the double extended ASCII character flag type
* FLAG num - sets the decimal number flag type (from 1 to 65000)
These flag types are used in affix files of such dictionaries as ar,
br_fr, ca, ca_valencia, da_dk, en_ca, en_gb, en_us, fr, gl_es, is,
ne_np, nl_nl, si_lk (from
PostgreSQL does not support FLAG parameter and can not load these
There is also AF parameter which allows to substitute affix flag sets
with ordinal numbers in affix and dictionary files.
FLAG and AF parameters are not supported by PostgreSQL. Supporting these
parameters allows to load dictionaries listed above into PostgreSQL
database and use them in full text search.
Internal representation of the dictionary in the PostgreSQL doesn't
impose too strict limits on the number of affix rules. There are a
flagval array, which size must be increased from 256 to 65000.
All other changes is the changes in the affix file parsing code to
properly parse long and numeric flags.
I've already implemented support for FLAG long, it require relatively
small patch size (60 lines). Support for FLAG num would require
comparable amount of code.
These changes would allow to use recent versions of Hunspell
dictionaries for following dictionaries:
br_fr, ca, ca_valencia, da_dk, gl_es, is, ne_np, nl_nl, si_lk.
Implementation of AF flag would allow to support also following
ar, en_ca, en_gb, en_us, fr, hu_hu.
These changes would allow to use more recent and complete spelling
dictionaries to perform word stemming during full-text indexing.
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company
|Next Message||Robert Haas||2015-10-20 14:01:51||Re: Allow ssl_renegotiation_limit in PG 9.5|
|Previous Message||Robert Haas||2015-10-20 13:41:49||Re: More work on SortSupport for text - strcoll() and strxfrm() caching|