Re: regex_fixed_prefix() is still a few bricks shy of a load

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: "pgsql-hackers(at)postgreSQL(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: regex_fixed_prefix() is still a few bricks shy of a load
Date: 2012-07-08 22:39:30
Message-ID: 17440.1341787170@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> On Jul 7, 2012, at 1:46 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> 3. Try another approach entirely. The idea that I've got in mind here
>> is to compile the regex using the regex library and then look at the
>> compiled NFA representation to see if there must be a fixed prefix.

> I think this is clearly the best way forward, probably even in the
> back branches. It's true that the wchar to mb conversion is largely
> untested, but it's also pretty simple code. Sure, it could have bugs,
> but so could whatever work-around you cobble together to avoid
> back-patching it. And it's not like we'll break anything else,
> either: the code will only be used in the case that is buggy right now
> anyway.

Attached is a draft patch for that. I'm fairly happy with the way this
turned out. The code is a bit longer than before, but most of the net
addition is boilerplate code associated with maintaining a well-defined
API for the regex library proper and the regex caching code in
utils/adt/regexp.c. The code that actually extracts the prefix string
(findprefix()) is about 150 lines, comparable to the net removal from
regex_fixed_prefix(), and it is *way* less heuristic: basically, given
the data structure it's working with, there is just one right answer.

One thing that remains slightly klugy is identification of the character
code associated with a colormap "color". The regex library's colormap
data structure is only designed to provide forward lookup from char
codes to colors; if you want to go the other way, the only possibility
is to serially probe each char value, which is untenable for non-Western
alphabets. What I did about this was to tweak the colormap building
code to remember just the first char code entered for each color. If,
after the dust settles, that's the only char belonging to the color,
we can use it --- otherwise, we just punt and stop extending the prefix
string. Given that we only care about doing reverse mapping for
singleton colors anyway, I believe that this is adequate. There are
cases where a color might have only one member but it isn't the first
one added, for example in "^abc[de]d". Here, d and e will be added to
a color for the bracket expression, and afterwards d will be split out
to its own subcolor, leaving e as the sole member of a color it wasn't
the first member of. But for our purposes it doesn't matter, because
both the d and e colors will be outarcs from the state after c, so
we couldn't extend the fixed prefix to include e anyway. There might
be some obscure corner cases where this implementation fails to
recognize a usable fixed-prefix character, but they have to be pretty
darn obscure.

Another loose end is that the API for regex_fixed_prefix() supposes
that it can return not only the fixed prefix string extracted from
the pattern, but also the "rest" of the pattern. There's no way to
reconstitute a "remaining pattern" in what I've added to backend/regex,
so in this patch regex_fixed_prefix() is just passing back the whole
pattern in the Pattern_Prefix_Partial case, which is likely to lead
to a lowball selectivity estimate when there's a long prefix.

Now the previous implementation of that is pretty darn brain-dead
anyway, because what it hands back is whatever's left when it stops
scanning. For instance, given "^(ab[de])xyz" it would extract the fixed
prefix "ab" and then return "rest" as "[de])xyz", which isn't a valid
regex. The only thing we are doing with that is passing it to
regex_selectivity(), which is too stupid to have a problem with it; but
obviously there is no chance of ever upgrading the logic without fixing
that somehow. (Note that right now we will never reach any of this code
anyway unless the target column has a small histogram; we realized long
ago that it was so bogus that relying on histogram_selectivity is a much
better idea if there's a reasonable amount of data...)

It's interesting to think about building something that would examine
the NFA representation of the regex, starting from whatever state
findprefix() stops at, and apply heuristic selectivity calculations
similar to what regex_selectivity() does. However I don't see where
to put such code while maintaining a reasonable API wall between PG's
selectivity estimators and the regex library. It's surely not something
that could be considered a general-purpose addition to a standalone
regex library.

What I have in mind to do for the moment is to refactor
regex_fixed_prefix() so that it doesn't hand back a "rest of pattern"
per se, but is directly responsible for handing back a selectivity
estimate for the rest of the pattern (a change we'd need to make anyway
if we ever did what's contemplated in the previous para). What it can
do is run regex_selectivity() over the whole pattern, and then divide
out FIXED_CHAR_SEL times the length of the fixed prefix, which should
compensate reasonably well for letting regex_selectivity() see the
prefix portion of the pattern.

One other point is that although this change adds regexp compilation
to the planner code path, that should be nearly free in common cases,
because we're caching the compiled regexp, thus saving having to compile
it at query runtime. I haven't tried to measure but I think the total
runtime should be pretty similar to what it was before.

Comments?

regards, tom lane

Attachment Content-Type Size
regexp-extract-prefix-1.patch text/x-patch 24.3 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2012-07-08 22:52:25 Re: Schema version management
Previous Message Peter Eisentraut 2012-07-08 19:31:32 Re: Schema version management