Skip site navigation (1) Skip section navigation (2)

GIN - Generalized Inverted iNdex. Try 2.

From: Teodor Sigaev <teodor(at)sigaev(dot)ru>
To: Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: GIN - Generalized Inverted iNdex. Try 2.
Date: 2006-04-26 14:23:20
Message-ID: 444F8258.2040102@sigaev.ru (view raw or flat)
Thread:
Lists: pgsql-hackers
We (me and Oleg) are glad to present GIN to PostgreSQL. If community will agree, 
we will commit it to HEAD branch.

http://www.sigaev.ru/gin/gin.gz
http://www.sigaev.ru/gin/README.txt

Install:
% cd  pgsql
% zcat gin.gz | patch -p0
make and initdb, install tsearch2


Changes from previous patch:
* add support for tsearch2
* add 'fuzzy' limit
* fixes

README:

Gin for PostgreSQL

Gin was sponsored by jfg://networks (http://www.jfg-networks.com/)

Gin stands for Generalized Inverted Index and should be considered as a genie,
not a drink.

Generalized means that index doesn't know what operation it accelerates, it
works with strategies, defined for specific data type (read Index Method
Strategies chapter in PostgreSQL documentation). In that sense, gin is similar
to the GiST and differs from btree index, which has predefined comparison based
operations.

Inverted index is an index structure storing a (key,posting list) pairs, where
posting list is a set of documents in which key occurs (document, usually,
contains many keys). The primary goal of the Gin index is a support for
scalable full text search in PostgreSQL.

Gin is consists of a B-tree constructed over entries (ET, entries tree), where
entry is an element of indexed value ( element of array, lexeme for tsvector)
and where each tuple in the leaf pages is either pointer to B-tree over item
pointers (PT, posting tree), or list of item pointers (PL, posting list) if
tuple is small enough.

Notes: There is no delete operation for ET. The reason for this, is that from
our experience, a set of unique words of a whole collection changed very rare.
This greatly simplify code and concurrency algorithm.

Gin comes with built-in support for one dimensional arrays ( integer[], text[],
no support for NULL elements) and following operations:

   * contains : value_array @ query_array
   * overlap : value_array && query_array
   * contained: value_array ~ query_array

Synopsis

=# create index txt_idx on aa using gin(a);

Features

   * Concurrency
   * WAL-logging (recoverable)
   * user-defined opclass, the scheme is similar to GiST
   * optimized index creation (make use of maintenance_work_mem to accumulate
     postings in memory)
   * tsearch2 opclass
   * soft upper limit of the returned results set using GUC variable
     gin_fuzzy_search_limit

Gin fuzzy limit

There are often situations, when full text search returns a very big set of
results, which is very difficult to manage, since reading tuples from disk and
their ordering could takes a lot of time, which is unacceptable for the
production (notice, that search itself is very fast). Such queries are usually
contain very frequent lexemes, so results are not very helpful. To facilitate
execution of such queries we introduced a configurable soft upper limit of the
size of the returned set - GUC variable gin_fuzzy_search_limit, which is 0 on
default (no limitation). This subset randomly chosen from the whole result set.
Soft means that the actual number of returned results could slightly differs
from this limit, depending on the query and the quality of system random
generator. From our experience, we found that value about several thousands
(5000-20000) is ok, i.e., gin fuzzy limit will have no effects for queries
returning result set lesser than this number.

Limitations

   * no support for multicolumn indices
   * Gin doesn't uses scan->kill_prior_tuple & scan->ignore_killed_tuples
   * Gin searches entry only by equality matching, this may be improved in
     future
   * Gin doesn't supports full scan of index
   * Gin doesn't index NULL value

Gin interface

OpClass interface (pseudocode). Example for opclas is in ginarayproc.c.

Datum* extractValue(Datum inputValue, uint32* nentries)
     Returns array of Datum of entries of value to be indexed, nentries should
     contains number of returning entries
int compareEntry( Datum a, Datum b )
     Compares two entries (not the indexing values!)
Datum* extractQuery(Datum query, uint32* nentries, StrategyNumber n)
     Returns array of Datum of entries of query to be searched, n contains
     Strategy number of operation.
bool consistent( bool[] check, StrategyNumber n, Datum query)
     The size of check array is the same as sizeof of array returned by
     extractQuery. Each element of check array is true if indexed value has
     corresponding entry in the query, i.e. if check[i] == TRUE then i-th entry
     of query is presented in indexed value. Function should returns true if
     indexed value matches by StrategyNumber and query.

Open items

We appreciate any comments, help and recommendations.

   * teach optimizer/executor, that GIN is intrinsically clustered, i.e., it
     always returns ItemPointer in ascending order.
   * tweak gincostestimate
   * GIN stores several ItemPointer to heap tuple, so vacuum full produces
     warning message:

      WARNING:  index "idx" contains 88395 row versions, but table contains 51812
  row versions
      HINT:  Rebuild the index with REINDEX.

TODO

Nearest future:

   * opclasses for all types (no programming, just many changes of the catalog).

Distant future:

   * Replace entry btree to something like the GiST
   * Add multicolumn support
   * Optimize insert operation (background index insertion)

Authors: All work was done by Teodor Sigaev (teodor(at)sigaev(dot)ru) and Oleg
Bartunov (oleg(at)sai(dot)msu(dot)su).

-- 
Teodor Sigaev                                   E-mail: teodor(at)sigaev(dot)ru
                                                    WWW: http://www.sigaev.ru/

Responses

pgsql-hackers by date

Next:From: Simon RiggsDate: 2006-04-26 14:48:04
Subject: Re: Avoiding redundant fetches of btree index metapages
Previous:From: Bruce MomjianDate: 2006-04-26 02:18:41
Subject: Re: [HACKERS] For Review: Allow WAL information to recover corrupted

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group