Re: [PROPOSAL] Shared Ispell dictionaries

From: Arthur Zakirov <a(dot)zakirov(at)postgrespro(dot)ru>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PROPOSAL] Shared Ispell dictionaries
Date: 2018-01-24 17:20:41
Message-ID: 20180124172039.GA11210@zakirov.localdomain
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, Jan 13, 2018 at 06:22:41PM +0300, Arthur Zakirov wrote:
> I think your proposals may be implemented in several patches, so they can
> be applyed independently but consistently. I suppose I will prepare new
> version of the patch with fixes and with initial design of new functions
> and commands soon.

I attached new version of the patch.

0001-Fix-ispell-memory-handling-v3.patch:

> allocate memory in the buildCxt. What about adding tmpstrdup to copy a
> string into the context? I admit this is mostly nitpicking though.

Fixed. Added tmpstrdup.

0002-Retreive-shmem-location-for-ispell-v3.patch:

dshash.c is used now instead of dynahash.c. A hash table is created during first call of a text search function in an instance. A hash table uses OID of a dictionary instead of file names, so there is no cross-db sharing at all.

Added max_shared_dictionaries_size GUC instead of shared_dictionaries. In current version it can be set only at server start. If a dictionary is allocated in a backend's memory instead of shared memory then LOG message will be raised which includes OID of the dictionary.

Fixed memory leak. During removing a dictionary and invalidating dictionaries cash ts_dict_shmem_release() is called. It unpins mapping of a dictionary, if reference count reaches zero then DSM segment will be unpinned. So allocated shared memory will be released by Postgres.

0003-Store-ispell-structures-in-shmem-v3.patch:

Added documentation fixes. dispell_init() (tmplinit too) has second argument, dictid.

0004-Update-tmplinit-arguments-v3.patch:

It is necessary to fix all dictionaries including contrib extensions because of second argument for tmplinit.

tmplinit has the following signature now:

dict_init(internal, internal)

0005-pg-ts-shared-dictinaries-view-v3.patch:

Added pg_ts_shared_dictionaries() function and pg_ts_shared_dictionaries system view. They return a list of dictionaries currently in shared memory, with the columns:
- dictoid
- schemaname
- dictname
- size

0006-Shared-memory-ispell-option-v3.patch:

Added SharedMemory option for Ispell dictionary template. It is true by default, because I think it would be good that people will haven't to do anything to allocate dictionaries in shared memory.

Setting SharedMemory=false during ALTER TEXT SEARCH DICTIONARY hasn't immediate effect. It is because ALTER doesn't force to invalidate dictionaries cache, if I'm not mistaken.

> 3) How do I unload a dictionary from the shared memory?
> ...
> ALTER TEXT SEARCH DICTIONARY x UNLOAD
>
> 4) How do I reload a dictionary?
> ...
> ALTER TEXT SEARCH DICTIONARY x RELOAD

I thought about it. And it seems to me that we can use functions ts_unload() and ts_reload() instead of new syntax. We already have text search functions like ts_lexize() and ts_debug(), and it is better to keep consistency. I think there are to approach for ts_unload():
- use DSM's pin and unpin methods and the invalidation callback, as it done during fixing memory leak. It has the drawback that it won't have an immediate effect, because DSM will be released only when all backends unpin DSM mapping.
- use DSA and dsa_free() method. As far as I understand dsa_free() frees allocated memory immediatly. But it requires more work to do, because we will need some more locks. For instance, what happens when someone calls ts_lexize() and someone else calls dsa_free() at the same time.

> 7) You mentioned you had to get rid of the compact_palloc0 - can you
> elaborate a bit why that was necessary? Also, when benchmarking the
> impact of this make sure to measure not only the time but also memory
> consumption.

It seems to me that there is no need compact_palloc0() anymore. Tests show that czech dictionary doesn't consume more memory after the patch.

Tests
-----

I've measured creation time of dictionaries on my 64-bit machine. You can get them from [1]. Here the master is 434e6e1484418c55561914600de9e180fc408378. I've measured french dictionary too because it has even bigger affix file than czech dictionary.

With patch:
czech_hunspell - 247 ms
english_hunspell - 59 ms
french_hunspell - 103 ms

Master:
czech_hunspell - 224 ms
english_hunspell - 52 ms
french_hunspell - 101 ms

Memory:

With patch (shared memory size + backend's memory):
czech_hunspell - 9573049 + 192584 total in 5 blocks; 1896 free (11 chunks); 190688 used
english_hunspell - 1985299 + 21064 total in 6 blocks; 7736 free (13 chunks); 13328 used
french_hunspell - 4763456 + 626960 total in 7 blocks; 7680 free (14 chunks); 619280 used

Here french dictionary uses more backend's memory because it has big affix file. Regular expression structures are stored in backend's memory still.

Master (backend's memory):
czech_hunspell - 17181544 total in 2034 blocks; 3584 free (10 chunks); 17177960 used
english_hunspell - 4160120 total in 506 blocks; 2792 free (10 chunks); 4157328 used
french_hunspell - 11439184 total in 1187 blocks; 18832 free (171 chunks); 11420352 used

You can see that dictionaries now takes almost two times less memory.

pgbench with select only script:

SELECT ts_lexize('czech_hunspell', 'slon');
patch: 30431 TPS
master: 30419 TPS

SELECT ts_lexize('english_hunspell', 'elephant'):
patch: 35029 TPS
master: 35276 TPS

SELECT ts_lexize('french_hunspell', 'éléphante');
patch: 22264 TPS
master: 22744 TPS

1 - https://github.com/postgrespro/hunspell_dicts

--
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Attachment Content-Type Size
0001-Fix-ispell-memory-handling-v3.patch text/plain 1.5 KB
0002-Retreive-shmem-location-for-ispell-v3.patch text/plain 17.6 KB
0003-Store-ispell-structures-in-shmem-v3.patch text/plain 89.1 KB
0004-Update-tmplinit-arguments-v3.patch text/plain 9.8 KB
0005-pg-ts-shared-dictinaries-view-v3.patch text/plain 12.1 KB
0006-Shared-memory-ispell-option-v3.patch text/plain 6.1 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message chenhj 2018-01-24 17:21:09 Re:Re: [HACKERS] [PATCH]make pg_rewind to not copy useless WAL files
Previous Message Tom Lane 2018-01-24 16:47:31 Re: [HACKERS] Patch: Add --no-comments to skip COMMENTs with pg_dump