Skip site navigation (1) Skip section navigation (2)

Re: First implementation of GIN for pg_trgm

From: Teodor Sigaev <teodor(at)sigaev(dot)ru>
To: Guillaume Smet <guillaume(dot)smet(at)gmail(dot)com>
Cc: pgsql-patches <pgsql-patches(at)postgresql(dot)org>, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
Subject: Re: First implementation of GIN for pg_trgm
Date: 2007-02-22 12:56:39
Message-ID: 45DD9307.4030402@sigaev.ru (view raw or flat)
Thread:
Lists: pgsql-patches
>  From a previous discussion with Teodor, it would be better to store an
> int in the index instead of a text (it takes less space and is
> faster). I couldn't find any example so if anyone has an advice to fix
> that, it's welcome (mostly how to pack the trigram into an int instead
> of a text).

Something like that:
trg = generate_trgm(VARDATA(text), VARSIZE(text) - VARHDRSZ);
nentries = ARRNELEM(trg);
if ( nentries > 0 )
{
	*entries = palloc(sizeof(Datum)*nentries);
	for(i=0;i<nentries;i++)
	{
		int tmp=0;
		trgm *ptr = GETARR(trg)+i;
	
		CPTRGM(&tmp, ptr);
		tmp >>= 8;
		entries[i] = Int32GetDatum(tmp);
	}
}
Do not forget to change CREATE OPERATOR CLASS accordingly.

> 
> The last problem is that similarity calculated in the GIN index is
> higher than the one with GIST so I have to set the trgm_limit quite
> high to have decent results (a limit of 0.8 instead of 0.3 seems to be
> quite good).
> AFAICS, it comes from the fact that I couldn't find any way to get the
> length of the indexed trigram which is taken into account with GIST so
> we're not exactly filtering the results in the same way.
> Does anyone have an idea on how to fix this point?

For calculating similarity, you should have three value: length of first word 
(let it be a indexed text) in trigrams, length of second word (query word) and
number of the same trigrams on both words. It's a pity, but during index scan we 
don't know length of indexed text. So, in index scan (consistent function) we 
could not compute exact value of similarity, but we can compute lower limit.
For example, if our query has 50 trigrams and only one of them is a common for 
indexed value and query we can conclude that indexed value can not be similar to 
our query. So, our consistent function should say FALSE when indexed value is 
not similar to query exactly and TRUE in opposite case. Let lquery is a length 
of query and ncommon is a number of common trigrams (look at cnt_sml() 
function), and consistent function should be:
#ifdef DIVUNION
/* original formula is: count/(lvalue+lquery-lcommon), so
    with any lvalue > 0 resulting similarity is smaller than
    computed below */
  return ( count/(lquery-lcommon) > limit ) ? TRUE : FALSE;
#else
/* original formula is: count/max(lvalue,lquery) - the same discourse */
return ( count/lquery > limit ) ? TRUE : FALSE;
#endif

Now consistent function doesn't guarantee exact result, so we should mark '%' 
operator in CREATE OPERATOR CLASS with RECHECK option.

-- 
Teodor Sigaev                                   E-mail: teodor(at)sigaev(dot)ru
                                                    WWW: http://www.sigaev.ru/

In response to

Responses

pgsql-patches by date

Next:From: Teodor SigaevDate: 2007-02-22 13:26:23
Subject: Re: tsearch in core patch, for inclusion
Previous:From: Markus SchiltknechtDate: 2007-02-22 12:45:13
Subject: Re: tsearch in core patch, for inclusion

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group