Re: WIP: index support for regexp search

From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Erikjan Rijkers <er(at)xs4all(dot)nl>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Tomas Vondra <tv(at)fuzzy(dot)cz>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Pavel Stìhule <pavel(dot)stehule(at)gmail(dot)com>
Subject: Re: WIP: index support for regexp search
Date: 2013-04-08 12:56:32
Message-ID: CAPpHfdvdcOaVx1z0RiQFgzhcSwd7Pkd9Wg3H-C3EbM=McFw8Tg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Apr 8, 2013 at 9:28 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> Alexander Korotkov <aekorotkov(at)gmail(dot)com> writes:
> > [ trgm-regexp-0.15.patch.gz ]
>
> I spent the weekend hacking on this, making a number of bug fixes and a
> whole lot of cosmetic changes. I think there are large parts of this
> that are in committable shape now, but I still find the actual graph
> transformation logic to be mostly unintelligible. I think what's most
> obscure is the distinction between the arcs list and the keys list of
> each state in the expanded graph. I get the impression that the
> general idea is for the arcs to represent exactly-known transitions
> while the keys represent imprecisely-known transitions ... but there
> seems to be at least some leakage between those categories. Could
> you write down a specification for what's supposed to be happening
> there?
>

Here is my try to specify it.
At first some notions. I know, they are already in the comments, but just
in order to put it together.

Extended color - any color of source CNFA or one of two special values:
1) Unknown color - may represent any character either alphanumeric or
non-alphanumeric.
2) Blank color - may represent any non-alphanumeric character
Prefix is extended colors of last two characters read by CNFA.
Key is pair of CNFA state and prefix. So, key is a extended state which is
containing additional information which can influence further trigrams.

So, if you are in some key and traverse some CNFA arc then you moves info
another key. But there are two possible cases (or, sometimes, both of them):
1) Your move from one key into another necessary means read of some
trigram. Then you create new arc labeled with that trigram.
2) You move into another key, but you doesn't necessary read an useful
trigram. For example, arc of source CNFA is labeled by "unextractable"
color. Then you add new key into "keys" array. And outgoing arcs from this
key will also be processed similarly to source key. Therefore "keys" array
is a set of keys which are achievable from "stateKey" without reading of
useful trigram.
We could get rid of "keys" array and produce some "empty arcs" in the
second case. You can imagine that "keys" array is set of keys which are
achivable by "empty arcs" from "stateKey". States connected with "empty
arcs" could be merged on the next stage.
However, the reason of having separated addKeys stage is optimization. In
addKey function more generals keys absorb less general ones. In many cases
resulting graph becomes much simplier because of this. For example, if your
regex is not prefix, then engine puts self-referencing arcs of all possible
colors to initial state. Straight-forward processing of this could produce
enormous output graph. I had similar situation in early version of patch
where keys didn't absorb earch other.

------
With best regards,
Alexander Korotkov.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Heikki Linnakangas 2013-04-08 13:37:46 Re: commit dfda6ebaec67 versus wal_keep_segments
Previous Message Samrat Revagade 2013-04-08 10:34:21 Inconsistent DB data in Streaming Replication