Re: WIP: index support for regexp search

From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Tomas Vondra <tv(at)fuzzy(dot)cz>, Erik Rijkers <er(at)xs4all(dot)nl>, pgsql-hackers(at)postgresql(dot)org, pavel(dot)stehule(at)gmail(dot)com
Subject: Re: WIP: index support for regexp search
Date: 2012-12-03 12:31:31
Message-ID: CAPpHfdshP6ituomYi0VUvEiN6pWsMcm+XztTjo7TeP1wUCXuVg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Dec 3, 2012 at 2:05 PM, Heikki Linnakangas
<hlinnakangas(at)vmware(dot)com>wrote:

> On 02.12.2012 20:19, Tom Lane wrote:
>
>> Alexander Korotkov<aekorotkov(at)gmail(dot)com> writes:
>>
>>> Nice idea to delay expanding colors to characters! Obviously, we should
>>> delay expanding inly alphanumerical characters. Because
>>> non-alphanumberical
>>> characters influence graph structure. Trying to implement...
>>>
>>
>> Uh, why would that be? Colors are colors. The regexp machinery doesn't
>> care whether they represent alphanumerics or not. (Or to be more
>> precise, if there is a situation where it makes a difference, separate
>> colors will have been created for each set of characters that need to be
>> distinguished.)
>>
>
> The regexp machinery doesn't care, but the trigrams that pg_trgm extracts
> only contain alphanumeric characters. So if by looking at the CNFA graph
> produced by the regexp machinery you conclude that any matching strings
> must contain three-letter sequences "%oo" and "#oo", you can just club them
> together into " oo" trigram.
>
> I think you can run a pre-processing step to the colors, and merge colors
> that are equivalent as far as trigrams are considered. For example, if you
> have a color that contains only character '%', and another that contains
> character '#', you can treat them as the same hcolor. You might then be
> able to simplify the CNFA. Actually, it would be even better if you could
> apply the pre-processing to the regexp before the regexp machinery turns it
> into a CNFA. Not sure how easy it would be to do such pre-processing.
>

Treating colors as same should be possible only for colors which has no
alphanumeric characters, because colors are non-overlapping. However, this
optimization could be significant in some cases.

BTW, why create the path matrix? You could check the "check" array of
> trigrams in the consistent function directly against the graph. Consistent
> should return true, iff there is a path through the graph following only
> arcs that contain trigrams present in the check array. Finding a path
> through a complex graph could be expensive, O(|E|), but if the path is
> complex, the path matrix would be large as well, and checking against a
> large matrix isn't exactly free either. It would allow you to avoid
> "overflows" caused by having too many paths through the graph.

Actually, I generally dislike path matrix for same reasons. But:
1) Output graphs could contain trigrams which are completely useless for
search. For example, for regex /(abcdefgh)*ijk/ we need only "ijk" trigram
while graph would contain much more.Path matrix is a method to get rid of
all of them.
2) If we use color trigrams then we need some criteria for which color
trigrams to expand into trigrams. Simultaneously, we shouldn't allow path
from initial state to the final by unexpanded trigrams. It seems much
harder to do with graph than with matrix.

------
With best regards,
Alexander Korotkov.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Kohei KaiGai 2012-12-03 12:39:21 Re: [v9.3] writable foreign tables
Previous Message Andres Freund 2012-12-03 12:22:12 Re: [PATCH 11/14] Introduce wal decoding via catalog timetravel