Re: Proposal: q-gram GIN and GiST indexes

From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Proposal: q-gram GIN and GiST indexes
Date: 2011-04-04 16:38:24
Message-ID: BANLkTin6d3NSe=LGhFi1O4BpNn2cz4EYpw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Apr 4, 2011 at 6:56 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

> On Mon, Apr 4, 2011 at 7:35 AM, Alexander Korotkov <aekorotkov(at)gmail(dot)com>
> wrote:
> > I would like to propose a q-gram module which would have following
> > differences in comparison with pg_trgm:
> > 1) Focus on acceleration of edit distance (e.g. levenshtein distance)
> > queries and LIKE/ILIKE queries
> > 2) Support of various q
>
> Doesn't the index size grow rather drastically with increasing q?

1) GIN
GIN index size can be represent as entries pages size + data pages size.
With increasement of data volume grow of entries pages size is slowing ,
because set of q-grams which are actually occurs in text is limited. It can
be illustrated on following example:

data set count avg. length entries pages data pages
english dictionary 98569 8.5 529 656
names of persons 105420 19.3 475 1985
paper titles 2526871 47.2 1234 84819

Statistics of GIN index of pg_trgm module is presented in table above. We
can see that ratio of entries pages to data pages is decreasing with
increasing of data volume. Since number of q-grams extracted from one
indexed value is similar for distinct q, we should not expect significant
grow of data pages size with increasing q. With increasing q we should
expect increase of entries pages size, but on large datasets and with not
very high q we shoudn't expect significant grow of index size. In some
papers I met assumption that set of whole q-grams in natural text is
relatively small when q <= 5. Accordingly, I think we should expect indexes
to be usable with at least with q = 5.

2) GiST
Since I'm going to store exact values in leaf nodes instead sets of q-grams,
index grow isn't directly expected with increasing of q. Because index size
would depends on size of indexed values and signature size(both don't
depends on q). There would be possible indirect index grow, because we
probably need longer signatures for appropriate representation of q-gram
set.

----
With best regards,
Alexander Korotkov.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alexander Korotkov 2011-04-04 16:46:06 Re: GSoC proposal: Fast GiST index build
Previous Message Tom Lane 2011-04-04 16:37:24 Re: [HACKERS] Uppercase SGML entity declarations