Skip site navigation (1) Skip section navigation (2)

Re: Proposal: q-gram GIN and GiST indexes

From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Proposal: q-gram GIN and GiST indexes
Date: 2011-04-04 16:38:24
Message-ID: BANLkTin6d3NSe=LGhFi1O4BpNn2cz4EYpw@mail.gmail.com (view raw or flat)
Thread:
Lists: pgsql-hackers
On Mon, Apr 4, 2011 at 6:56 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

> On Mon, Apr 4, 2011 at 7:35 AM, Alexander Korotkov <aekorotkov(at)gmail(dot)com>
> wrote:
> > I would like to propose a q-gram module which would have following
> > differences in comparison with pg_trgm:
> > 1) Focus on acceleration of edit distance (e.g. levenshtein distance)
> > queries and LIKE/ILIKE queries
> > 2) Support of various q
>
> Doesn't the index size grow rather drastically with increasing q?


1) GIN
GIN index size can be represent as entries pages size + data pages size.
With increasement of data volume grow of entries pages size is slowing ,
because set of q-grams which are actually occurs in text is limited. It can
be illustrated on following example:

data set              count   avg. length   entries pages   data pages
english dictionary    98569           8.5             529          656
names of persons     105420          19.3             475         1985
paper titles        2526871          47.2            1234        84819

Statistics of GIN index of pg_trgm module is presented in table above. We
can see that ratio of entries pages to data pages is decreasing with
increasing of data volume. Since number of q-grams extracted from one
indexed value is similar for distinct q, we should not expect significant
grow of data pages size with increasing q. With increasing q we should
expect increase of entries pages size, but on large datasets and with not
very high q we shoudn't expect significant grow of index size. In some
papers I met assumption that set of whole q-grams in natural text is
relatively small when q <= 5. Accordingly, I think we should expect indexes
to be usable with at least with q = 5.

2) GiST
Since I'm going to store exact values in leaf nodes instead sets of q-grams,
index grow isn't directly expected with increasing of q. Because index size
would depends on size of indexed values and signature size(both don't
depends on q). There would be possible indirect index grow, because we
probably need longer signatures for appropriate representation of q-gram
set.

----
With best regards,
Alexander Korotkov.

In response to

Responses

pgsql-hackers by date

Next:From: Alexander KorotkovDate: 2011-04-04 16:46:06
Subject: Re: GSoC proposal: Fast GiST index build
Previous:From: Tom LaneDate: 2011-04-04 16:37:24
Subject: Re: [HACKERS] Uppercase SGML entity declarations

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group