From: | Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com> |
---|---|
To: | PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Cc: | Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Subject: | experimental: TSearch dictionary [de]serialization |
Date: | 2010-08-31 22:19:20 |
Message-ID: | AANLkTinnim1joUog5bWsFW06uC4vVESZg6XoH40sbTSw@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hello
I wrote a some very primitive code for testing serialization and de
serialization of TSearch ISpell dictionary. This code working - but it
is useful only for speed test now.
Czech fulltext dictionary is serialized to cca 9MB long file. Saving
needs about 90ms and reading needs same time.
postgres=# select * from ts_debug('cs','příliš žluťoučký kůň se napil
žluté vody');
alias │ description │ token │ dictionaries │
dictionary │ lexemes
───────────┼───────────────────┼───────────┼─────────────────┼────────────┼─────────────
word │ Word, all letters │ příliš │ {cspell,simple} │ cspell
│ {příliš}
blank │ Space symbols │ │ {} │ [null]
│ [null]
word │ Word, all letters │ žluťoučký │ {cspell,simple} │ cspell
│ {žluťoučký}
blank │ Space symbols │ │ {} │ [null]
│ [null]
word │ Word, all letters │ kůň │ {cspell,simple} │ cspell
│ {kůň}
blank │ Space symbols │ │ {} │ [null]
│ [null]
asciiword │ Word, all ASCII │ se │ {cspell,simple} │ cspell │ {}
blank │ Space symbols │ │ {} │ [null]
│ [null]
asciiword │ Word, all ASCII │ napil │ {cspell,simple} │ cspell
│ {napít}
blank │ Space symbols │ │ {} │ [null]
│ [null]
word │ Word, all letters │ žluté │ {cspell,simple} │ cspell
│ {žlutý}
blank │ Space symbols │ │ {} │ [null]
│ [null]
asciiword │ Word, all ASCII │ vody │ {cspell,simple} │ cspell
│ {voda}
(13 rows)
Time: 92.708 ms -- with using a preprocessed dictionary
postgres=# select * from ts_debug('cs','příliš žluťoučký kůň se napil
žluté vody');
alias │ description │ token │ dictionaries │
dictionary │ lexemes
───────────┼───────────────────┼───────────┼─────────────────┼────────────┼─────────────
word │ Word, all letters │ příliš │ {cspell,simple} │ cspell
│ {příliš}
blank │ Space symbols │ │ {} │ [null]
│ [null]
word │ Word, all letters │ žluťoučký │ {cspell,simple} │ cspell
│ {žluťoučký}
blank │ Space symbols │ │ {} │ [null]
│ [null]
word │ Word, all letters │ kůň │ {cspell,simple} │ cspell
│ {kůň}
blank │ Space symbols │ │ {} │ [null]
│ [null]
asciiword │ Word, all ASCII │ se │ {cspell,simple} │ cspell │ {}
blank │ Space symbols │ │ {} │ [null]
│ [null]
asciiword │ Word, all ASCII │ napil │ {cspell,simple} │ cspell
│ {napít}
blank │ Space symbols │ │ {} │ [null]
│ [null]
word │ Word, all letters │ žluté │ {cspell,simple} │ cspell
│ {žlutý}
blank │ Space symbols │ │ {} │ [null]
│ [null]
asciiword │ Word, all ASCII │ vody │ {cspell,simple} │ cspell
│ {voda}
(13 rows)
Time: 3.758 ms -- standard time (dictionary is loaded)
postgres=# select * from ts_debug('cs','příliš žluťoučký kůň se napil
žluté vody');
alias │ description │ token │ dictionaries │
dictionary │ lexemes
───────────┼───────────────────┼───────────┼─────────────────┼────────────┼─────────────
word │ Word, all letters │ příliš │ {cspell,simple} │ cspell
│ {příliš}
blank │ Space symbols │ │ {} │ [null]
│ [null]
word │ Word, all letters │ žluťoučký │ {cspell,simple} │ cspell
│ {žluťoučký}
blank │ Space symbols │ │ {} │ [null]
│ [null]
word │ Word, all letters │ kůň │ {cspell,simple} │ cspell
│ {kůň}
blank │ Space symbols │ │ {} │ [null]
│ [null]
asciiword │ Word, all ASCII │ se │ {cspell,simple} │ cspell │ {}
blank │ Space symbols │ │ {} │ [null]
│ [null]
asciiword │ Word, all ASCII │ napil │ {cspell,simple} │ cspell
│ {napít}
blank │ Space symbols │ │ {} │ [null]
│ [null]
word │ Word, all letters │ žluté │ {cspell,simple} │ cspell
│ {žlutý}
blank │ Space symbols │ │ {} │ [null]
│ [null]
asciiword │ Word, all ASCII │ vody │ {cspell,simple} │ cspell
│ {voda}
(13 rows)
Time: 518.528 ms --- typical first evaluation time
So using a preprocessed file helps - the time of first processing is
about 4x better. But still this time is 20x slower than using a loaded
dictionary. I found a one issue - I am not able to serialize a full
regexp. Czech dictionary doesn't use it, so I didn't solve this task.
I would to like implement a few hooks to ISpellDictionary to be
possible implement own memory management for ispell dictionaries. I
understand to problems with shared memory or mmap - but I don't see
any different way, than use a third party mmap support. This module
must not be in core - probably this is only local Czech (and maybe
Japan) problem.
Regards
Pavel Stehule
Attachment | Content-Type | Size |
---|---|---|
ft02.diff | text/x-patch | 19.4 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | David Fetter | 2010-08-31 22:24:14 | Re: Synchronous replication - patch status inquiry |
Previous Message | Bruce Momjian | 2010-08-31 21:44:15 | Re: Synchronous replication - patch status inquiry |