Quick Links

experimental: TSearch dictionary [de]serialization

From:	Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
To:	PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Cc:	Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject:	experimental: TSearch dictionary [de]serialization
Date:	2010-08-31 22:19:20
Message-ID:	AANLkTinnim1joUog5bWsFW06uC4vVESZg6XoH40sbTSw@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Hello

I wrote a some very primitive code for testing serialization and de
serialization of TSearch ISpell dictionary. This code working - but it
is useful only for speed test now.

Czech fulltext dictionary is serialized to cca 9MB long file. Saving
needs about 90ms and reading needs same time.

postgres=# select * from ts_debug('cs','příliš žluťoučký kůň se napil
žluté vody');
alias │ description │ token │ dictionaries │
dictionary │ lexemes
───────────┼───────────────────┼───────────┼─────────────────┼────────────┼─────────────
word │ Word, all letters │ příliš │ {cspell,simple} │ cspell
│ {příliš}
blank │ Space symbols │ │ {} │ [null]
│ [null]
word │ Word, all letters │ žluťoučký │ {cspell,simple} │ cspell
│ {žluťoučký}
blank │ Space symbols │ │ {} │ [null]
│ [null]
word │ Word, all letters │ kůň │ {cspell,simple} │ cspell
│ {kůň}
blank │ Space symbols │ │ {} │ [null]
│ [null]
asciiword │ Word, all ASCII │ se │ {cspell,simple} │ cspell │ {}
blank │ Space symbols │ │ {} │ [null]
│ [null]
asciiword │ Word, all ASCII │ napil │ {cspell,simple} │ cspell
│ {napít}
blank │ Space symbols │ │ {} │ [null]
│ [null]
word │ Word, all letters │ žluté │ {cspell,simple} │ cspell
│ {žlutý}
blank │ Space symbols │ │ {} │ [null]
│ [null]
asciiword │ Word, all ASCII │ vody │ {cspell,simple} │ cspell
│ {voda}
(13 rows)

Time: 92.708 ms -- with using a preprocessed dictionary

Time: 3.758 ms -- standard time (dictionary is loaded)

Time: 518.528 ms --- typical first evaluation time

So using a preprocessed file helps - the time of first processing is
about 4x better. But still this time is 20x slower than using a loaded
dictionary. I found a one issue - I am not able to serialize a full
regexp. Czech dictionary doesn't use it, so I didn't solve this task.
I would to like implement a few hooks to ISpellDictionary to be
possible implement own memory management for ispell dictionaries. I
understand to problems with shared memory or mmap - but I don't see
any different way, than use a third party mmap support. This module
must not be in core - probably this is only local Czech (and maybe
Japan) problem.

Regards

Pavel Stehule

Attachment	Content-Type	Size
ft02.diff	text/x-patch	19.4 KB

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	David Fetter	2010-08-31 22:24:14	Re: Synchronous replication - patch status inquiry
Previous Message	Bruce Momjian	2010-08-31 21:44:15	Re: Synchronous replication - patch status inquiry