Re: [PROPOSAL] Shared Ispell dictionaries

From: Arthur Zakirov <a(dot)zakirov(at)postgrespro(dot)ru>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PROPOSAL] Shared Ispell dictionaries
Date: 2018-01-13 15:22:41
Message-ID: CAKNkYnzQ8SJCiFityn+4B_0zZKOmfw1R1M0KHXaFF8BYf9DJ=g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello,

Thank you Tomas for your review.

On Sat, Jan 13, 2018 at 03:25:55AM +0100, Tomas Vondra wrote:
> allocate memory in the buildCxt. What about adding tmpstrdup to copy a
> string into the context? I admit this is mostly nitpicking though.

I agree about tmpstrdup(). It will be self consistent with tmpalloc().

> 1) Why do we actually need the limit? Is it really necessary / useful?
> ...
> I realize the current implementation requires that, because the hash
> table is still created in an old-style memory context (and only the
> dictionaries are in DSM segments).

Yes indeed. I tried to implement dynahash via DSM, but I failed. It
seems to me that dynahash can work only in an old-style memory context.

> I'm not sure if dynahash can live in a DSM segment, but we already have
> a hash table that supports that in dshash.c (which is also concurrent,
> although I'm not sure if that's a major advantage for this use case).

Thank you a lot for pointing on dshash.c. I think this is just what we
need. I will try to use it in the new version of the patch.

> 2) Do we actually want/need some limits? Which ones?
> ...
> And finally, I believe this is log-worthy - right now the dictionary
> load silently switches to backend memory (thus incurring all the parsing
> overhead). This certainly deserves at least a log message.

I think such log message may be usefull, so I will add it too.

> So I do suggest adding such "max memory for shared dictionaries" limit.
> I'm not sure we can enforce it strictly, because when deciding where to
> load the dict we haven't parsed it yet and so don't know how much memory
> will be required. But I believe a lazy check should be fine (load it,
> and if we exceeded the total memory disable loading additional ones).

With dshash in DSM it seems that shared_dictionaries GUC variable is not needed
anymore. I aggree that another GUC variable (for example,
max_shared_dictionaries_size) may be useful. But maybe it's worth
checking the size of a dictionary only after actual compiling? We can do
the following:
- within ispell_shmem_location() build a dictionary using the callback
function
- the callback function returns its size, if the dictionary doesn't fit
into a remaining shared space ispell_shmem_location() just will return
pointer to the palloc'ed and compiled dictionary without creating a
DSM segment.

> 3) How do I unload a dictionary from the shared memory?
> ...
> ALTER TEXT SEARCH DICTIONARY x UNLOAD
>
> 4) How do I reload a dictionary?
> ...
> ALTER TEXT SEARCH DICTIONARY x RELOAD

I think these syntax will be very useful not only for Ispell but for
other dictionaries too. So init_function of a text search template may
return pointers for a C funtions which unload and reload dictionaries.
This approach doesn't require to change the catalog by adding additional
functions for the template [1].

If init_function of a template didn't return pointers then this template
doesn't support unloading or reloading. And UNLOAD and RELOAD commands
should throw an error if a user calles them for such template.

> 5) Actually, how do I list currently loaded dictionaries (and how much
> memory they use in the shared memory)?

This is may be very useful too. This function can be called as
pg_get_shared_dictionaries().

> 6) What other restrictions would be useful?
> ...
> CREATE TEXT SEARCH DICTIONARY x (
> TEMPLATE = ispell,
> DictFile = czech,
> AffFile = czech,
> StopWords = czech,
> SharedMemory = true/false (default: false)
> );

Hm, I didn't think about such option. It will be a very simple way
of shared dictionary control for a user.

> 7) You mentioned you had to get rid of the compact_palloc0 - can you
> elaborate a bit why that was necessary? Also, when benchmarking the
> impact of this make sure to measure not only the time but also memory
> consumption.

As I understood from the commit 3e5f9412d0a818be77c974e5af710928097b91f3
compact_palloc0() reduces overhead from a lot of palloc's for small
chunks of data. And persistent data of the patch should not suffer from
this overhead, because persistent data is allocated using big chunks.

But now I realized that we can keep compact_palloc0() for small chunks of
temporary data. And it may be worth to save compact_palloc0().

> 8) One more thing - I've noticed that the hash table uses this key:
> ...
> That is, full paths to the two files, and I'm not sure that's a very
> good idea. Firstly, it's a bit wasteful (1kB per path). But more
> importantly it means all dictionaries referencing the same files will
> share the same chunk of shared memory - not only within a single
> database, but across the whole cluster. That may lead to surprising
> behavior, because e.g. unloading a dictionary in one database will
> affect dictionaries in all other databases referencing the same files.

Hm, indeed. It's worth to use only file names instead full paths. And
it is good idea to use more information besides file names. It can be
Oid of a database and Oid of a namespace maybe, because a
dictionary can be created in different schemas.

I think your proposals may be implemented in several patches, so they can
be applyed independently but consistently. I suppose I will prepare new
version of the patch with fixes and with initial design of new functions
and commands soon.

1 - https://www.postgresql.org/docs/current/static/sql-createtstemplate.html

--
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Simon Riggs 2018-01-13 15:40:01 Re: Changing WAL Header to reduce contention during ReserveXLogInsertLocation()
Previous Message Andrew Dunstan 2018-01-13 14:29:46 Re: Transform for pl/perl