Re: Partial match in GIN (next vesrion)

From: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Teodor Sigaev <teodor(at)sigaev(dot)ru>, Patches <pgsql-patches(at)postgresql(dot)org>
Subject: Re: Partial match in GIN (next vesrion)
Date: 2008-05-16 18:37:27
Message-ID: Pine.LNX.4.64.0805162226440.21547@sn.sai.msu.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-patches

Wildspeed was designed as an example application of the GIN's partial
match and as a useful extension for *short* strings. It's also good
standalone demonstration of GIN API. We tried to stay away from full text
search, parser, word delimiters and etc.
>From that point of view it might be
useful contrib, but I agree we have to think better to let it more
usable.

Oleg

On Fri, 16 May 2008, Tom Lane wrote:

> Teodor Sigaev <teodor(at)sigaev(dot)ru> writes:
>> http://www.sigaev.ru/misc/partial_match_gin-0.10.gz
>> http://www.sigaev.ru/misc/tsearch_prefix-0.9.gz
>> http://www.sigaev.ru/misc/wildspeed-0.12.tgz
>
> I've applied the first two of these with minor editorialization (mostly
> fixing documentation). However, I'm having a hard time convincing myself
> that anyone will find wildspeed useful in its current form. I did a
> simple experiment using a table of titles of database papers:
>
> contrib_regression=# select count(*), avg(length(title)) from pub;
> count | avg
> --------+---------------------
> 236984 | 64.7647520507713601
> (1 row)
>
> This takes about 22MB on disk as a Postgres table. I was expecting the
> wildspeed index to be about 65 times as large, which is bad enough
> already, but actually it weighed in at 2165MB or nearly 100X bigger.
> Plus it took forever to build: 35 minutes on a fairly fast machine
> with maintenance_work_mem set to 512MB.
>
> In comparison, building a conventional full-text-search index (GIN
> tsvector) took about 22 seconds including constructing the tsvector
> column, and the tsvectors plus index take about 54MB. The relative
> search performance is about what you'd expect from the difference in
> index sizes, ie, wildspeed loses.
>
> So I'm thinking wildspeed really needs to be redesigned if it's to be
> anything but a toy. I can't see putting it into contrib in this form.
>
> One idea that I had was to break the given string into words (splitting
> at spaces or punctuation) and store the rotations of individual words
> instead of the whole string. (Actually, maybe you only need suffixes
> not rotations, ie for 'abcd' store 'abcd', 'bcd', 'cd', 'd'.) Then
> similarly break the LIKE pattern apart at words to create word-fragment
> search keys. In this scheme the operator would always(?) require
> rechecking since any part of the pattern involving punctuation wouldn't
> be checkable by the index. The advantage is that the index bloat factor
> is governed by the average word length not the average whole-string
> length.
>
> There are probably other approaches that would help, too.
>
> regards, tom lane
>

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

In response to

Browse pgsql-patches by date

  From Date Subject
Next Message Merlin Moncure 2008-05-16 19:49:36 Re: libpq object hooks
Previous Message Andrew Chernow 2008-05-16 18:34:20 Re: libpq object hooks