Quick Links

Re: [PROPOSAL] Shared Ispell dictionaries

From:	Arthur Zakirov <a(dot)zakirov(at)postgrespro(dot)ru>
To:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: [PROPOSAL] Shared Ispell dictionaries
Date:	2017-12-31 15:28:13
Message-ID:	20171231152811.GA4233@arthur.localdomain
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Hello, hackers,

On Tue, Dec 26, 2017 at 07:48:27PM +0300, Arthur Zakirov wrote:
> The patch will be ready and added into the 2018-03 commitfest.
>

I attached the patch itself.

0001-Fix-ispell-memory-handling.patch:

Some strings are allocated via compact_palloc0(). But they are not
persistent, so they should be allocated using temporary memory context.
Also a couple strings are not released if .aff file had new format.

0002-Retreive-shmem-location-for-ispell.patch:

Adds ispell_shmem_location() function which look for location for a
dictionary using .dict and .aff file names. If the location haven't been
allocated in DSM earlier, allocate it. Shared hash table is used here to
search the location.

Maximum number of elements of hash table is NUM_DICTIONARIES=20 now. It
will be better to use a GUC-variable. Also if the number of elements
reached the limit then it will be good to use backend's local memory
instead of shared.

0003-Store-ispell-structures-in-shmem.patch:

Introduces IspellDictBuild and IspellDictData structures, removes
IspellDict structure. IspellDictBuild is used during building the
dictionary, if it haven't been allocated in DSM earlier, within
dispell_build() function. IspellDictBuild has a pointer to
IspellDictData structure, which will be filled with persistent data.

After building the dictionary IspellDictData is copied into
DSM location and temporary data of IspellDictBuild is released.

All prefix trees are stored as a flat array now. Those arrays are
allocated and stored using NodeArray struct now. Required node can be
retreied by node offset. AffixData and Affix arrays have additional
offset array to retreive an element by index.

Affix field (array of AFFIX) of IspellDictBuild is persistent data also. But it is
constructed as a temporary array first, Affix array need to be sorted
via qsort() within NISortAffixes().

So IspellDictData stores:
- AffixData - array of strings, access via AffixDataOffset
- Affix - array of AFFIX, access via AffixOffset
- DictNodes, PrefixNodes, SuffixNodes - prefix trees as a plain array
- CompoundAffix - array of CMPDAffix sequential access

I had to remove compact_palloc0() added by Pavel in
3e5f9412d0a818be77c974e5af710928097b91f3. Ispell dictionary doesn't need
such allocation anymore. It was used to allocate a little locations. I
will definity check performance of Czech dictionary.

There are issues to do:
- add the GUC-variable for hash table limit
- fix bugs
- improve comments
- performance testing

--
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Attachment	Content-Type	Size
0001-Fix-ispell-memory-handling.patch	text/plain	1020 bytes
0002-Retreive-shmem-location-for-ispell.patch	text/plain	7.4 KB
0003-Store-ispell-structures-in-shmem.patch	text/plain	77.3 KB

In response to

[PROPOSAL] Shared Ispell dictionaries at 2017-12-26 16:48:27 from Arthur Zakirov

Responses

Re: [PROPOSAL] Shared Ispell dictionaries at 2018-01-07 19:05:27 from Arthur Zakirov

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Tatsuo Ishii	2017-12-31 15:33:13	Re: Fix a Oracle-compatible　instr function　in the documentation
Previous Message	Andrey Borodin	2017-12-31 14:05:34	Re: Faster inserts with mostly-monotonically increasing values