experimental: TSearch dictionary [de]serialization

From: Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
To: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Cc: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: experimental: TSearch dictionary [de]serialization
Date: 2010-08-31 22:19:20
Message-ID: AANLkTinnim1joUog5bWsFW06uC4vVESZg6XoH40sbTSw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello

I wrote a some very primitive code for testing serialization and de
serialization of TSearch ISpell dictionary. This code working - but it
is useful only for speed test now.

Czech fulltext dictionary is serialized to cca 9MB long file. Saving
needs about 90ms and reading needs same time.

postgres=# select * from ts_debug('cs','příliš žluťoučký kůň se napil
žluté vody');
alias │ description │ token │ dictionaries │
dictionary │ lexemes
───────────┼───────────────────┼───────────┼─────────────────┼────────────┼─────────────
word │ Word, all letters │ příliš │ {cspell,simple} │ cspell
│ {příliš}
blank │ Space symbols │ │ {} │ [null]
│ [null]
word │ Word, all letters │ žluťoučký │ {cspell,simple} │ cspell
│ {žluťoučký}
blank │ Space symbols │ │ {} │ [null]
│ [null]
word │ Word, all letters │ kůň │ {cspell,simple} │ cspell
│ {kůň}
blank │ Space symbols │ │ {} │ [null]
│ [null]
asciiword │ Word, all ASCII │ se │ {cspell,simple} │ cspell │ {}
blank │ Space symbols │ │ {} │ [null]
│ [null]
asciiword │ Word, all ASCII │ napil │ {cspell,simple} │ cspell
│ {napít}
blank │ Space symbols │ │ {} │ [null]
│ [null]
word │ Word, all letters │ žluté │ {cspell,simple} │ cspell
│ {žlutý}
blank │ Space symbols │ │ {} │ [null]
│ [null]
asciiword │ Word, all ASCII │ vody │ {cspell,simple} │ cspell
│ {voda}
(13 rows)

Time: 92.708 ms -- with using a preprocessed dictionary

postgres=# select * from ts_debug('cs','příliš žluťoučký kůň se napil
žluté vody');
alias │ description │ token │ dictionaries │
dictionary │ lexemes
───────────┼───────────────────┼───────────┼─────────────────┼────────────┼─────────────
word │ Word, all letters │ příliš │ {cspell,simple} │ cspell
│ {příliš}
blank │ Space symbols │ │ {} │ [null]
│ [null]
word │ Word, all letters │ žluťoučký │ {cspell,simple} │ cspell
│ {žluťoučký}
blank │ Space symbols │ │ {} │ [null]
│ [null]
word │ Word, all letters │ kůň │ {cspell,simple} │ cspell
│ {kůň}
blank │ Space symbols │ │ {} │ [null]
│ [null]
asciiword │ Word, all ASCII │ se │ {cspell,simple} │ cspell │ {}
blank │ Space symbols │ │ {} │ [null]
│ [null]
asciiword │ Word, all ASCII │ napil │ {cspell,simple} │ cspell
│ {napít}
blank │ Space symbols │ │ {} │ [null]
│ [null]
word │ Word, all letters │ žluté │ {cspell,simple} │ cspell
│ {žlutý}
blank │ Space symbols │ │ {} │ [null]
│ [null]
asciiword │ Word, all ASCII │ vody │ {cspell,simple} │ cspell
│ {voda}
(13 rows)

Time: 3.758 ms -- standard time (dictionary is loaded)

postgres=# select * from ts_debug('cs','příliš žluťoučký kůň se napil
žluté vody');
alias │ description │ token │ dictionaries │
dictionary │ lexemes
───────────┼───────────────────┼───────────┼─────────────────┼────────────┼─────────────
word │ Word, all letters │ příliš │ {cspell,simple} │ cspell
│ {příliš}
blank │ Space symbols │ │ {} │ [null]
│ [null]
word │ Word, all letters │ žluťoučký │ {cspell,simple} │ cspell
│ {žluťoučký}
blank │ Space symbols │ │ {} │ [null]
│ [null]
word │ Word, all letters │ kůň │ {cspell,simple} │ cspell
│ {kůň}
blank │ Space symbols │ │ {} │ [null]
│ [null]
asciiword │ Word, all ASCII │ se │ {cspell,simple} │ cspell │ {}
blank │ Space symbols │ │ {} │ [null]
│ [null]
asciiword │ Word, all ASCII │ napil │ {cspell,simple} │ cspell
│ {napít}
blank │ Space symbols │ │ {} │ [null]
│ [null]
word │ Word, all letters │ žluté │ {cspell,simple} │ cspell
│ {žlutý}
blank │ Space symbols │ │ {} │ [null]
│ [null]
asciiword │ Word, all ASCII │ vody │ {cspell,simple} │ cspell
│ {voda}
(13 rows)

Time: 518.528 ms --- typical first evaluation time

So using a preprocessed file helps - the time of first processing is
about 4x better. But still this time is 20x slower than using a loaded
dictionary. I found a one issue - I am not able to serialize a full
regexp. Czech dictionary doesn't use it, so I didn't solve this task.
I would to like implement a few hooks to ISpellDictionary to be
possible implement own memory management for ispell dictionaries. I
understand to problems with shared memory or mmap - but I don't see
any different way, than use a third party mmap support. This module
must not be in core - probably this is only local Czech (and maybe
Japan) problem.

Regards

Pavel Stehule

Attachment Content-Type Size
ft02.diff text/x-patch 19.4 KB

Browse pgsql-hackers by date

  From Date Subject
Next Message David Fetter 2010-08-31 22:24:14 Re: Synchronous replication - patch status inquiry
Previous Message Bruce Momjian 2010-08-31 21:44:15 Re: Synchronous replication - patch status inquiry