Re: [PROPOSAL] Shared Ispell dictionaries

From: Arthur Zakirov <a(dot)zakirov(at)postgrespro(dot)ru>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PROPOSAL] Shared Ispell dictionaries
Date: 2019-04-05 16:21:31
Message-ID: b879eb32-5d07-02cb-7a20-c5431fbd0b6c@postgrespro.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello hackers,

On 25.02.2019 14:33, Arthur Zakirov wrote:
>> The current approach inherently involves double-buffering: you've got
>> the filesystem cache containing the data read from disk, and then the
>> DSM containing the converted form of the data.  Having something that
>> you could just mmap() would avoid that, plus it would become a lot
>> less critical to keep the mappings around.  You could probably just
>> have individual queries mmap() it for as long as they need it and then
>> tear out the mapping when they finish executing; keeping the mappings
>> across queries likely wouldn't be too important in this case.
>>
>> The downside is that you'd probably need to teach resowner.c about
>> mappings created via mmap() so that you don't leak mappings on an
>> abort, but that's probably not a crazy difficult problem.
>
> It seems to me Tom and Andres also vote for the mmap() approach. I think
> I need to look closely at the mmap().
>
> I've labeled the patch as 'v13'.

I've attached new version of the patch. Note that it is in WIP state for
now and there are unresolved issues, which is listed at the end of the
email.

The patch implements simple approach of using mmap(). Also I want to be
sure that I'm going in right direction. Feel free to send a feedback.

On every dispell_init() call Postgres checks is there a shared
dictionary file in the pg_shdict directory, if it is then calls mmap().
If there is no such file then it compiles the dictionary, write it to
the file and calls mmap().

dispell_lexize() works with already mmap'ed dictionary. So it doesn't
mmap() for each individual query as Robert proposed above. It's because
such approach reduces performance twice (I tested with ts_lexize() calls
by pgbench).

Tests
-----

Like in:
https://www.postgresql.org/message-id/20180124172039.GA11210%40zakirov.localdomain

i performed tests. There are now big differences in numbers except that
files are being created now in the pg_shdict directory:

czech_hunspell - 9.2 MB file
english_hunspell - 1.9 MB file
french_hunspell - 4.6 MB file

TODO
----

- Improve the documentation and comments.
- Eliminate shared dictionary files after DROP/ALTER calls. It necessary
to come up with some fancy file name. For now it is just OID of a
dictionary. So it is possible to add database OID, xmin or xmax into a
file name.
- We cant remove the file right away after DROP/ALTER. Is it good idea
to use autovacuum here?

--
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Attachment Content-Type Size
0001-Fix-ispell-memory-handling-v19.patch text/x-patch 1.6 KB
0002-Change-tmplinit-argument-v19.patch text/x-patch 12.5 KB
0003-Retrieve-shared-location-for-dict-v19.patch text/x-patch 180.4 KB
0004-Store-ispell-in-shared-location-v19.patch text/x-patch 90.5 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Shawn Debnath 2019-04-05 16:22:06 Re: Refactoring the checkpointer's fsync request queue
Previous Message Thibaut 2019-04-05 16:12:12 Re: selecting from partitions and constraint exclusion