Re: WIP: index support for regexp search

From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: Tomas Vondra <tv(at)fuzzy(dot)cz>, Erik Rijkers <er(at)xs4all(dot)nl>, pgsql-hackers(at)postgresql(dot)org, pavel(dot)stehule(at)gmail(dot)com
Subject: Re: WIP: index support for regexp search
Date: 2012-12-02 18:06:22
Message-ID: CAPpHfdsTEGmx-fmBzaqFED8ghiDX+UA0duph=UL383w80A=oDw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Nov 30, 2012 at 6:23 PM, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com
> wrote:

> On 30.11.2012 13:20, Alexander Korotkov wrote:
>
>> On Thu, Nov 29, 2012 at 5:25 PM, Heikki Linnakangas<hlinnakangas(at)**
>> vmware.com <hlinnakangas(at)vmware(dot)com>
>>
>>> wrote:
>>>
>>
>> Would it be safe to simply stop short the depth-first search on overflow,
>>> and proceed with the graph that was constructed up to that point?
>>>
>>
>> For depth-first it's not. But your proposal naturally makes sense. I've
>> changed it to breadth-first search. And then it's safe to mark all
>> unprocessed states as final when overflow. It means that we assume every
>> unprocessed branch to immediately finish with matching (this can give us
>> more false positives but no false negatives).
>> For overflow of matrix collection, it's safe to do just OR between all the
>> trigrams.
>> New version of patch is attached.
>>
>
> Thanks, sounds good.
>
> I've spent quite a long time trying to understand the transformation the
> getState/addKeys/addAcrs functions do to the original CNFA graph. I think
> that still needs more comments to explain the steps involved in it.
>
> One thing that occurs to me is that it might be better to delay expanding
> colors to characters. You could build and transform the graph, and even
> create the paths, all while operating on colors. You would end up with
> lists of "color trigrams", consisting of sequences of three colors that
> must appear in the source string. Only at the last step you would expand
> the color trigrams to real character trigrams. I think that would save a
> lot of processing while building the graph, if you have colors that contain
> many characters. It would allow us to do better than the fixed small
> MAX_COLOR_CHARS limit. For example with MAX_COLOR_CHARS = 4 in the current
> patch, it's a shame that we can't do anything with a fairly simple regexp
> like "^a[b-g]h$"
>

Nice idea to delay expanding colors to characters! Obviously, we should
delay expanding inly alphanumerical characters. Because non-alphanumberical
characters influence graph structure. Trying to implement...

------
With best regards,
Alexander Korotkov.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alexander Korotkov 2012-12-02 18:07:48 Re: WIP: index support for regexp search
Previous Message Andrew Dunstan 2012-12-02 17:36:32 Re: proposal: separate databases for contrib module testing