Re: Proposal: q-gram GIN and GiST indexes

From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Proposal: q-gram GIN and GiST indexes
Date: 2011-04-04 11:35:13
Message-ID: BANLkTi=A6S0bsdPqmZDzuQ0SH4fhEUfGKA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Here is text of my GSoC proposal. Given details probably makes essence of my
proposal clear. Any comments are welcome.

*Name of project*

Q-gram indexing module
*
*
*Synopsis*

Currently PostgreSQL has support for trigram-based string collection
indexing in pg_trgm module. Indexes in pg_trgm was originally focused on
trigram similatiry queries optimization. LIKE/ILIKE operations support was
added by my patch, but this support is limited, because trigrams not always
can be extracted from wildcard.
I would like to propose a q-gram module which would have following
differences in comparison with pg_trgm:
1) Focus on acceleration of edit distance (e.g. levenshtein distance)
queries and LIKE/ILIKE queries
2) Support of various q
3) Support of positional q-grams (q-gram which contain position in which it
appears in text) in GIN which give a hope to handle edit distance queries
better than ordinal q-grams
4) LIKE/ILIKE acceleration even in case when no q-gram can be extracted from
wildcard (via partial match)
5) Support of various signature size in GiST
6) Store exact string value in leaf pages of GiST index tree (In spite of
set of q-grams, exact string value takes less space and allow exact
filtering for LIKE/ILIKE and edit distance filtering)

*Benefits to the PostgreSQL Community*

Additional support of q-gram indexing will consolidate PostgreSQL as most
advanced DBMS in fuzzy search on string collections. Q-gram indexes will be
applicable for industry tasks.

*Quantifiable results*

Acceleration of edit distance queries and LIKE/ILIKE queries.

*Project Details*

*Q-gram index for LIKE/ILIKE queries*

Basic idea is so [1]:
1) Extract q-grams which should anyway present in string which conforms to
wildcard.
2) Filter only string which contain q-grams extracted on previous step. As
the rule, recheck is required.
Currently this approach was impelemented for pg_trgm, but there is following
limitation besides fixed q. In some cases we can extract no q-grams from
wildcard (for example '%zz%' and q = 3). In GIN we could filter all q-grams
which starts from 'zz' using partial match. But it require to store original
q-gram in index (in pg_trgm crc32 in stored in the case of multibyte
encoding).

*Using q-gram index for edit distance queries*

Number of common trigrams in strings p and r is at least max{length(p),
length(r)} - q + 1 - k*q [2,5]. If p is query string and r is indexed
string, we don't know exactly length of r. Then we can use following minimal
common q-grams number bound length(p) - q + 1 - k*q. This bound allow us to
filter strings using q-gram GIN and GiST indexes. Moreover, when filtering
using GIN or GiST index we know about absence in indexing string of
particular q-grams. In some cases it should be more effective then minimal
bound of common q-grams. For examle, if query string is 'abcdefg' then
absence of trigrams 'abc' and 'efg' can ensure us that minimal possible edit
distance is 2.

*Using positional q-gram index for edit distance queries*

In string to set of q-grams decomposition not only q-gram presence itself
but also q-gram position is useful information [3,4,5]. GIN index can store
q-gram position as lower part of key. Q-gram position can be-used for edit
distance filtering. For example, if we extract q-gram x with position l from
query with threshold k, then corresponding q-gram in indexed string should
have position in interval [l-k,l+k]. This condition can be used for more
effective filtering than with ordinal q-grams.

*Questions to discuss with community*

1) Find a way to give additional parameters to index (value of q and
signature size for GiST). If it wouldn't be possible to add
extension-specific parameters to GiST and GIN indexes then distinct
opclasses can be created for various parameters values;
2) Find a way to represent edit distance query. Edit distance query require
two parameters query string and threshold. Threshold is maximum edit
distance between result string and query string. Similar problem presents in
pg_trgm for trigram similarity query: we need threshold of matchind trigrams
amount. In pg_trgm threshold is declared as global session parameter and it
can be manipulated by set_limit show_limit. However this approach have some
limitation. For example, different thresholds can't be combined in same
query. Probably, we should consider using of composite type for edit
distance query respresentation.
3) Find appropriate place for proposed functionality: core, contrib module
or a separate project.

*Inch-stones (project broken into small, distinct chunks)*

1) Produce a solution af architecture questions with help of community.
2) Edit distance operator implementation
3) Implement GIN qgram index
b) Index building implementation
c) Edit distance strategy implementation
d) LIKE/ILIKE strategy implementation
4) Implement GIN pqgram index
a) Index building implementation
b) Edit distance strategy implementation
c) LIKE/ILIKE strategy implementation
5) Implement GiST qgram index
a) Index building implementation
b) Edit distance strategy implementation
c) Edit distance KNN-strategy implementation
d) LIKE/ILIKE strategy implementation
6) Perfomance testing on real life datasets
7) Manual writing
8) Code cleanup and commiting

*Project Schedule*

until May 22
Solve architecture questins with help of commutiny.

May 23 - May 29
Implement edit distance filter operator and KNN-operator.

May 30 - June 12
Implement GIN q-gram index.

June 13 - June 26
Implement GIN pq-gram index.

June 27 - July 10
Implement GiST q-gram index.

July 11 - July 24
Perform testing on real life data.

July 25 - July 31
Manural writing.

August 1 - August 16
Final code cleanup and commiting.

*Links*

1)
http://nlp.stanford.edu/IR-book/html/htmledition/k-gram-indexes-for-wildcard-queries-1.html
2) Jokinen, P., Ukkonen, E.: Two Algorithms for Approximate String Matching
in Static Texts, Proceedings of the 16th Symposium on Mathematical
Foundations of Computer Science, number 520 in LNCS, Springer, 1991.
3) Erkki Sutinen and Jorma Tarhio. On using q-gram locations in approximate
string matching. Algorithms -- ESA '95, p. 327-340
4) Chen Li, Bin Wang, Xiaochun Yang. VGRAM: improving performance of
approximate queries on string collections using variable-length grams. VLDB
'07 Proceedings of the 33rd international conference on Very large data
bases
5) Luis Gravano, Panagiotis G. Ipeirotis, H. V. Jagadish, Nick Koudas, S.
Muthukrishnan, Divesh Srivastava. Approximate String Joins in a Database
(Almost) for Free. VLDB '01 Proceedings of the 27th International Conference
on Very Large Data Bases

----
With best regards,
Alexander Korotkov.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Kevin Grittner 2011-04-04 12:07:00 Re: GUC assign hooks (was Re: wal_buffers = -1 and SIGHUP)
Previous Message Alexander Korotkov 2011-04-04 11:16:04 GSoC proposal: Fast GiST index build