WIP: shared ispell dictionary

From: Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
To: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Cc: Teodor Sigaev <teodor(at)sigaev(dot)ru>
Subject: WIP: shared ispell dictionary
Date: 2010-03-18 10:33:46
Message-ID: 162867791003180333s1933e5b7g9208dd9a2bb681c6@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello

attached patch add possibility to share ispell dictionary between
processes. The reason for this is the slowness of first tsearch query
and size of allocated memory per process. When I tested loading of
ispell dictionary (for Czech language) I got about 500 ms and 48MB.
With simple allocator it uses only 25 MB. If we remove some check and
tolower string transformation from loading stage it needs only 200 ms.
But with broken dict or affix file it can put wrong results. This
patch significantly reduce load on servers that use ispell
dictionaries.

I know so Tom worries about using of share memory. I think so it
unnecessarily. After loading data from dictionary are only read, never
modified. Second idea - this dictionary template can be distributed as
separate project (it needs a few changes in core - and simple
allocator).

Using:

a) set shared_data = 26MB (postgres.conf)
b) restart
c) register dictionary with option "share=yes"

CREATE TEXT SEARCH DICTIONARY cspell
(template=ispell, dictfile = czech, afffile=czech, stopwords=czech,
share = yes);

[pavel(at)nemesis src]$ psql-dev3 postgres
Timing is on.
psql-dev3 (9.0devel)
Type "help" for help.

postgres=# select * from ts_debug('cs','Příliš žluťoučký kůň se napil
žluté vody');
alias | description | token | dictionaries |
dictionary | lexemes
-----------+-------------------+-----------+-----------------+------------+-------------
word | Word, all letters | Příliš | {cspell,simple} | cspell
| {příliš}
blank | Space symbols | | {} | |
word | Word, all letters | žluťoučký | {cspell,simple} | cspell
| {žluťoučký}
blank | Space symbols | | {} | |
word | Word, all letters | kůň | {cspell,simple} | cspell
| {kůň}
blank | Space symbols | | {} | |
asciiword | Word, all ASCII | se | {cspell,simple} | cspell | {}
blank | Space symbols | | {} | |
asciiword | Word, all ASCII | napil | {cspell,simple} | cspell
| {napít}
blank | Space symbols | | {} | |
word | Word, all letters | žluté | {cspell,simple} | cspell
| {žlutý}
blank | Space symbols | | {} | |
asciiword | Word, all ASCII | vody | {cspell,simple} | cspell
| {voda}
(13 rows)

Time: 8,178 ms <<-- without patch 500ms

Limits and ToDo:
a) it support only simple regular expressions
b) it doesn't solve cache reset a shared memory deallocation

Regards
Pavel Stehule

Attachment Content-Type Size
shared_dictionary_02.diff application/octet-stream 40.9 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Gokulakannan Somasundaram 2010-03-18 11:06:11 Re: An idle thought
Previous Message Simon Riggs 2010-03-18 09:43:24 Re: Command to prune archive at restartpoints