Re: [PROPOSAL] Shared Ispell dictionaries

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Arthur Zakirov <a(dot)zakirov(at)postgrespro(dot)ru>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PROPOSAL] Shared Ispell dictionaries
Date: 2018-01-13 02:25:55
Message-ID: d12d9395-922c-64c9-c87d-dd0e1d31440e@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi Arthur,

I've done some initial review of the patch today, and here are some
thoughts:

0001-Fix-ispell-memory-handling-v2.patch

This makes sense. The patch simply replaces two cpstrdup() calls with
MemoryContextStrdup, but I see spell.c already has two macros to
allocate memory in the buildCxt. What about adding tmpstrdup to copy a
string into the context? I admit this is mostly nitpicking though.

0002-Retreive-shmem-location-for-ispell-v2.patch

I think the GUC name should make it clear it's a maximum number of
something, just like "max_parallel_workers" and other such GUCs. When I
first saw "shared_dictionaries" in the patch I thought it's a list of
dictionary names, or something like that.

I have a bunch of additional design questions and proposals (not
necessarily required for v1, but perhaps useful for shaping it).

1) Why do we actually need the limit? Is it really necessary / useful?

When I wrote shared_ispell back in 2012, all we had were fixed segments
allocated at start, and so similar limits were a built-in restriction.
But after the DSM stuff was introduced I imagined it would not be necessary.

I realize the current implementation requires that, because the hash
table is still created in an old-style memory context (and only the
dictionaries are in DSM segments).

But that seems fairly straightforward to fix by maintaining the hash
table in a separate DSM segment too. So lookup of the dictionary DSM
would have to fist check what the current hash table segment is, and
then continue as now.

I'm not sure if dynahash can live in a DSM segment, but we already have
a hash table that supports that in dshash.c (which is also concurrent,
although I'm not sure if that's a major advantage for this use case).

2) Do we actually want/need some limits? Which ones?

That is not to say we don't need/want some limits, but the current limit
may not be the droid we're looking for, for a couple of reasons.

Firstly, currently it only matters during startup, when the dynahash is
created. So to change the limit (e.g. to increase it) you actually have
to restart the database, which is obviously a major hassle.

Secondly, dynahash tweaks the values to get proper behavior. For example
it's not using the values directly but some higher value of 2^N form.
Which means the limit may not enforced immediately when hitting the GUC,
but unexpectedly somewhat later.

And finally, I believe this is log-worthy - right now the dictionary
load silently switches to backend memory (thus incurring all the parsing
overhead). This certainly deserves at least a log message.

Actually, I'm not sure "number of dictionaries" is a particularly useful
limit in the first place - that's not a number I really care about. But
I do care about amount of memory consumed by the loaded dictionaries.

So I do suggest adding such "max memory for shared dictionaries" limit.
I'm not sure we can enforce it strictly, because when deciding where to
load the dict we haven't parsed it yet and so don't know how much memory
will be required. But I believe a lazy check should be fine (load it,
and if we exceeded the total memory disable loading additional ones).

3) How do I unload a dictionary from the shared memory?

Assume we've reached the limit (it does not matter if it's the number of
dictionaries or memory used by them). How do I resolve that without
restarting the database? How do I unload a dictionary (which may be
unused) from shared memory?

ALTER TEXT SEARCH DICTIONARY x UNLOAD

4) How do I reload a dictionary?

Assume I've updated the dictionary files (added new words into the
files, or something like that). How do I reload the dictionary? Do I
have to restart the server, DROP/CREATE everything again, or what?

What about instead having something like this:

ALTER TEXT SEARCH DICTIONARY x RELOAD

5) Actually, how do I list currently loaded dictionaries (and how much
memory they use in the shared memory)?

6) What other restrictions would be useful?

I think it should be possible to specify which ispell dictionaries may
be loaded into shared memory, and which should be always loaded into
local backend memory. That is, something like

CREATE TEXT SEARCH DICTIONARY x (
TEMPLATE = ispell,
DictFile = czech,
AffFile = czech,
StopWords = czech,
SharedMemory = true/false (default: false)
);

because otherwise the dictionaries will compete for shared memory, and
it's unclear which of them will get loaded. For a server with a single
application that may not be a huge issue, but think about servers shared
by multiple applications, etc.

In the extension this was achieved kinda explicitly by definition of a
separate 'shared_ispell' template, but if you modify the current one
that won't work, of course.

7) You mentioned you had to get rid of the compact_palloc0 - can you
elaborate a bit why that was necessary? Also, when benchmarking the
impact of this make sure to measure not only the time but also memory
consumption.

In fact, that was the main reason why Pavel implemented it in 2010,
because the czech dictionary takes quite a bit of memory, and without
the shared memory a copy was kept in every backend.

Of course, maybe that would be mostly irrelevant thanks to this patch
(due to changes to the representation and keeping just a single copy).

8) One more thing - I've noticed that the hash table uses this key:

typedef struct
{
char dictfile[MAXPGPATH];
char afffile[MAXPGPATH];
} TsearchDictKey;

That is, full paths to the two files, and I'm not sure that's a very
good idea. Firstly, it's a bit wasteful (1kB per path). But more
importantly it means all dictionaries referencing the same files will
share the same chunk of shared memory - not only within a single
database, but across the whole cluster. That may lead to surprising
behavior, because e.g. unloading a dictionary in one database will
affect dictionaries in all other databases referencing the same files.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2018-01-13 02:42:49 Re: [HACKERS] [BUGS] BUG #14825: enum type: unsafe use?
Previous Message Claudio Freire 2018-01-13 02:06:27 Re: Changing WAL Header to reduce contention during ReserveXLogInsertLocation()