Quick Links

Re: WIP: index support for regexp search

From:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc:	Tomas Vondra <tv(at)fuzzy(dot)cz>, Erik Rijkers <er(at)xs4all(dot)nl>, pgsql-hackers(at)postgresql(dot)org, pavel(dot)stehule(at)gmail(dot)com
Subject:	Re: WIP: index support for regexp search
Date:	2012-11-30 11:20:02
Message-ID:	CAPpHfdtpYFWFe8bRdmmvKSD39YM8UTeEzuMrKHi8Rr61CzzdtA@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Thu, Nov 29, 2012 at 5:25 PM, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com
> wrote:

> One thing that bothers me with this algoritm is that the overflow
> mechanism is all-or-nothing. In many cases, even when there is a huge
> number of states in the diagram, you could still extract at least a few
> trigrams that must be present in any matching string, with little effort.
> At least, it seems like that to a human :-).
>
> For example, consider this:
>
> explain analyze select count(*) from azjunk4 where txt ~ ('^**
> aabaacaadaaeaafaagaahaaiaajaak**aalaamaanaaoaapaaqaaraasaataau**
> aavaawaaxaayaazabaabbabcabdabe**abfabgabhabiabjabkablabmabnabo**
> abpabqabrabsabtabuabvabwabxaby**abzacaacbaccacdaceacfacgachaci**
> acjackaclacmacnacoacpacqacracs**actacuacvacwacxacyaczadaadbadc**
> addadeadfadgadhadiadjadkadladm**adnadoadpadqadradsadtaduadvadw**
> adxadyadzaeaaebaecaedaeeaefaeg**aehaeiaejaekaelaemaenaeoaepaeq**
> aeraesaetaeuaevaewaexaeyaezafa**afbafcafdafeaffafgafhafiafjafk**
> aflafmafnafoafpafqafrafsaftafu**afvafwafxafyafzagaagbagcagdage**
> agfaggaghagiagjagkaglagmagnago**agpagqagragsagtaguagvagwagxagy**
> agzahaahbahcahdaheahfahgahhahi**ahjahkahlahmahnahoahpahqahrahs**$');
>
> you get a query plan like this (the long regexp string edited out):
>
> Aggregate (cost=228148.02..228148.03 rows=1 width=0) (actual
> time=131.100..131
> .101 rows=1 loops=1)
> -> Bitmap Heap Scan on azjunk4 (cost=228144.01..228148.02 rows=1
> width=0) (
> actual time=131.096..131.096 rows=0 loops=1)
> Recheck Cond: (txt ~ <ridiculously long regexp>)
> Rows Removed by Index Recheck: 10000
> -> Bitmap Index Scan on azjunk4_trgmrgx_txt_01_idx
> (cost=0.00..228144
> .01 rows=1 width=0) (actual time=82.914..82.914 rows=10000 loops=1)
> Index Cond: (txt ~ <ridiculously long regexp>)
> Total runtime: 131.230 ms
> (7 rows)
>
> That ridiculously long string exceeds the number of states (I think, could
> be number of paths or arcs too), and the algorithm gives up, resorting to
> scanning the whole index as can be seen by the "Rows Removed by Index
> Recheck" line. However, it's easy to see that any matching string must
> contain *any* of the possible trigrams the algorithm extracts. If it could
> safely return just a few of them, say "aab" and "abz", and discard the
> rest, that would already be much better than a full index scan.
>
> Would it be safe to simply stop short the depth-first search on overflow,
> and proceed with the graph that was constructed up to that point?

For depth-first it's not. But your proposal naturally makes sense. I've
changed it to breadth-first search. And then it's safe to mark all
unprocessed states as final when overflow. It means that we assume every
unprocessed branch to immediately finish with matching (this can give us
more false positives but no false negatives).
For overflow of matrix collection, it's safe to do just OR between all the
trigrams.
New version of patch is attached.

------
With best regards,
Alexander Korotkov.

Attachment	Content-Type	Size
trgm-regexp-0.7.patch.gz	application/x-gzip	15.0 KB

In response to

Re: WIP: index support for regexp search at 2012-11-29 13:25:22 from Heikki Linnakangas

Responses

Re: WIP: index support for regexp search at 2012-11-30 11:28:37 from Alexander Korotkov
Re: WIP: index support for regexp search at 2012-11-30 14:23:59 from Heikki Linnakangas

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Alexander Korotkov	2012-11-30 11:22:05	Re: WIP: index support for regexp search
Previous Message	Dimitri Fontaine	2012-11-30 11:11:58	Re: Refactoring standby mode logic