Re: wildcard search support for pg_trgm

From: Jesper Krogh <jesper(at)krogh(dot)cc>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: wildcard search support for pg_trgm
Date: 2011-01-24 16:14:11
Message-ID: 4D3DA553.2070909@krogh.cc
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2011-01-24 16:34, Alexander Korotkov wrote:
> Hi!
>
> On Mon, Jan 24, 2011 at 3:07 AM, Jan Urbański<wulczer(at)wulczer(dot)org> wrote:
>
>> I see two issues with this patch. First of them is the resulting index
>> size. I created a table with 5 copies of
>> /usr/share/dict/american-english in it and a gin index on it, using
>> gin_trgm_ops. The results were:
>>
>> * relation size: 18MB
>> * index size: 109 MB
>>
>> while without the patch the GIN index was 43 MB. I'm not really sure
>> *why* this happens, as it's not obvious from reading the patch what
>> exactly is this extra data that gets stored in the index, making it more
>> than double its size.
>>
> Do you sure that you did comparison correctly? The sequence of index
> building and data insertion does matter. I tried to build gin index on 5
> copies of /usr/share/dict/american-english with patch and got 43 MB index
> size.
>
>
>> That leads me to the second issue. The pg_trgm code is already woefully
>> uncommented, and after spending quite some time reading it back and
>> forth I have to admit that I don't really understand what the code does
>> in the first place, and so I don't understand what does that patch
>> change. I read all the changes in detail and I could't find any obvious
>> mistakes like reading over array boundaries or dereferencing
>> uninitialized pointers, but I can't tell if the patch is correct
>> semantically. All test cases I threw at it work, though.
>>
> I'll try to write sufficient comment and send new revision of patch.
>
Would it be hard to make it support "n-grams" (e.g. making the length
configurable) instead of trigrams? I actually had the feeling that
penta-grams (pen-tuples or whatever they would be called) would
be better for my usecase (large substring-search in large documents ..
eg. 500 within 3.000.

Larger sizes.. lesser "sensitivity" => Faster lookup .. perhaps my logic
is wrong?

Hm.. or will the knngist stuff help me here by selecting the best using
pentuples from the beginning?

The above comment is actually general to pg_trgm and not to the wildcard
search
patch above.

Jesper
--
Jesper

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Heikki Linnakangas 2011-01-24 16:52:31 Re: Allowing multiple concurrent base backups
Previous Message Alexander Korotkov 2011-01-24 15:34:54 Re: wildcard search support for pg_trgm