wildcard search support for pg_trgm

From: Jan Urbański <wulczer(at)wulczer(dot)org>
To: Postgres - Hackers <pgsql-hackers(at)postgresql(dot)org>
Cc: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Subject: wildcard search support for pg_trgm
Date: 2011-01-24 00:07:56
Message-ID: 4D3CC2DC.6060002@wulczer.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

I tested the patch from
http://archives.postgresql.org/message-id/AANLkTikVxx6_AjZB52OnA7MBzijCPDqSZoMCD-3u1ATq@mail.gmail.com
which adds GIN and GIST index support for wildcard LIKE queries using
pg_trgm.

The patch is a context diff that applies cleanly. Regression test work
after applying it, but they only exercise the similarity() function, so
the new functionality is not covered by them.

The patch seems to work as advised, I tried a few searches and it does
indeed use the gin or gist index to implement '%foo%' searches. I tried
to do some tricky queries and it worked for all of them..

I see two issues with this patch. First of them is the resulting index
size. I created a table with 5 copies of
/usr/share/dict/american-english in it and a gin index on it, using
gin_trgm_ops. The results were:

* relation size: 18MB
* index size: 109 MB

while without the patch the GIN index was 43 MB. I'm not really sure
*why* this happens, as it's not obvious from reading the patch what
exactly is this extra data that gets stored in the index, making it more
than double its size.

That leads me to the second issue. The pg_trgm code is already woefully
uncommented, and after spending quite some time reading it back and
forth I have to admit that I don't really understand what the code does
in the first place, and so I don't understand what does that patch
change. I read all the changes in detail and I could't find any obvious
mistakes like reading over array boundaries or dereferencing
uninitialized pointers, but I can't tell if the patch is correct
semantically. All test cases I threw at it work, though.

I'm not sure if the committer with better knowledge of pg_trgm would be
able to do a better job than me. After a few days digging in that code I
simply give up.

This patch changes the names and signatures of some support functions
for GIN, and I'm not sure how that affects binary compatibility and
pg_upgrade. I tried to create an index with the vanilla source, and then
recompile pg_trgm and reindex the table, but it still was not using the
index. I think it's because it's missing entries in the catalogs about
the index supporting the like strategy. How should this be handled?

I'm going to mark the patch as Waiting on Author, because of the index
size issue (though it might be OK and expected that the index size will
grow so much, I just don't know). As for the comments, or lack of them,
I declary myself incompetent to thoroughly verify that the patch works.
I think it should have at least the added parts commented enough to
match the project's standard.

Sorry for taking so long to review this,

Cheers,
Jan

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Nate C 2011-01-24 01:01:09 Bug? Unexpected argument handling in pl-python variadic argument function
Previous Message Robert Haas 2011-01-24 00:07:14 Re: Bug in pg_describe_object, patch v2