Re: First implementation of GIN for pg_trgm

From: "Guillaume Smet" <guillaume(dot)smet(at)gmail(dot)com>
To: "Teodor Sigaev" <teodor(at)sigaev(dot)ru>
Cc: pgsql-patches <pgsql-patches(at)postgresql(dot)org>, "Oleg Bartunov" <oleg(at)sai(dot)msu(dot)su>
Subject: Re: First implementation of GIN for pg_trgm
Date: 2007-02-22 14:00:32
Message-ID: 1d4e0c10702220600q715a9329q98ecf0ada562c0a1@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-patches

On 2/22/07, Teodor Sigaev <teodor(at)sigaev(dot)ru> wrote:
> > From a previous discussion with Teodor, it would be better to store an
> > int in the index instead of a text (it takes less space and is
> > faster). I couldn't find any example so if anyone has an advice to fix
> > that, it's welcome (mostly how to pack the trigram into an int instead
> > of a text).
>
> Something like that: [snip]

I worked on that this morning before I got your email. You'll find
attached a second version of the patch using int4 to store the
trigrams (it's not exactly what you gave me but it should work I
think).

I didn't see any improvement in terms of size of the index (14 MB for
642 738 rows in the index in both cases) or speed.
Our dictionary table contains 78367 words and its size is 3 MB.

Did I miss something?

If there's no benefit to use an int4, I propose we keep the text
version because it will be better if we add UTF-8 support someday.

> our query. So, our consistent function should say FALSE when indexed value is
> not similar to query exactly and TRUE in opposite case. Let lquery is a length
> of query and ncommon is a number of common trigrams (look at cnt_sml()
> function), and consistent function should be:

Yes, that's what I did in my patch if I understand you correctly but
it's not very selective IMHO.

> Now consistent function doesn't guarantee exact result, so we should mark '%'
> operator in CREATE OPERATOR CLASS with RECHECK option.

The attached patch adds a RECHECK too. It seems to work correctly but
the RECHECK COND costs a lot of time :/.

At the moment, I have the following results:
GIST:
-------
test=# EXPLAIN ANALYZE SELECT word, similarity('aub', word) AS sml
FROM lieu_mots_gist WHERE word % 'aub' ORDER BY sml DESC;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------
Sort (cost=204.40..204.59 rows=78 width=11) (actual
time=98.028..98.132 rows=50 loops=1)
Sort Key: similarity('aub'::text, word)
-> Bitmap Heap Scan on lieu_mots_gist (cost=4.88..201.95 rows=78
width=11) (actual time=97.575..97.861 rows=50 loops=1)
Recheck Cond: (word % 'aub'::text)
-> Bitmap Index Scan on idx_word_gist (cost=0.00..4.86
rows=78 width=0) (actual time=97.539..97.539 rows=50 loops=1)
Index Cond: (word % 'aub'::text)
Total runtime: 98.303 ms
(7 rows)

test=# EXPLAIN ANALYZE SELECT word, similarity('auberge', word) AS sml
FROM lieu_mots_gist WHERE word % 'auberge' ORDER BY sml DESC;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------
Sort (cost=204.40..204.59 rows=78 width=11) (actual
time=135.128..135.196 rows=41 loops=1)
Sort Key: similarity('auberge'::text, word)
-> Bitmap Heap Scan on lieu_mots_gist (cost=4.88..201.95 rows=78
width=11) (actual time=134.829..135.016 rows=41 loops=1)
Recheck Cond: (word % 'auberge'::text)
-> Bitmap Index Scan on idx_word_gist (cost=0.00..4.86
rows=78 width=0) (actual time=134.797..134.797 rows=41 loops=1)
Index Cond: (word % 'auberge'::text)
Total runtime: 135.335 ms
(7 rows)

With GIN:
------------

test=# EXPLAIN ANALYZE SELECT word, similarity('aub', word) AS sml
FROM lieu_mots_gin WHERE word % 'aub' ORDER BY sml DESC;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------
Sort (cost=208.45..208.64 rows=78 width=11) (actual
time=60.409..60.513 rows=50 loops=1)
Sort Key: similarity('aub'::text, word)
-> Bitmap Heap Scan on lieu_mots_gin (cost=8.93..205.99 rows=78
width=11) (actual time=10.279..60.228 rows=50 loops=1)
Filter: (word % 'aub'::text)
-> Bitmap Index Scan on idx_word_gin (cost=0.00..8.91
rows=78 width=0) (actual time=10.069..10.069 rows=6676 loops=1)
Index Cond: (word % 'aub'::text)
Total runtime: 60.688 ms
(7 rows)

test=# EXPLAIN ANALYZE SELECT word, similarity('auberge', word) AS sml
FROM lieu_mots_gin WHERE word % 'auberge' ORDER BY sml DESC;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------
Sort (cost=208.45..208.64 rows=78 width=11) (actual
time=50.293..50.381 rows=41 loops=1)
Sort Key: similarity('auberge'::text, word)
-> Bitmap Heap Scan on lieu_mots_gin (cost=8.93..205.99 rows=78
width=11) (actual time=21.006..50.122 rows=41 loops=1)
Filter: (word % 'auberge'::text)
-> Bitmap Index Scan on idx_word_gin (cost=0.00..8.91
rows=78 width=0) (actual time=20.839..20.839 rows=928 loops=1)
Index Cond: (word % 'auberge'::text)
Total runtime: 50.534 ms
(7 rows)

--
Guillaume

Attachment Content-Type Size
pg_trgm_gin3.diff text/plain 6.2 KB

In response to

Responses

Browse pgsql-patches by date

  From Date Subject
Next Message Teodor Sigaev 2007-02-22 15:00:29 Re: First implementation of GIN for pg_trgm
Previous Message Teodor Sigaev 2007-02-22 13:33:57 Re: tsearch in core patch, for inclusion