Re: Notes about fixing regexes and UTF-8 (yet again)

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Notes about fixing regexes and UTF-8 (yet again)
Date: 2012-02-19 04:49:32
Message-ID: CA+TgmoYs86AawgrUnEfE=HsMGm8r2dN-x-2W-eL-8dUurQD7cQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, Feb 18, 2012 at 11:16 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> In theory you can imagine a regular expression engine where these
>> decisions can be postponed until we see the string we're matching
>> against.  IOW, your DFA ends up with state transitions for characters
>> specifically named, plus a state transition for "anything else that's
>> a letter", plus a state transition for "anything else not otherwise
>> specified".  Then you only need to test the letters that actually
>> appear in the target string, rather than all of the ones that might
>> appear there.
>
>> But implementing that could be quite a lot of work.
>
> Yeah, not to mention slow.  The difficulty is overlapping sets of
> characters.  As a simple example, if your regex refers to 3, 7,
> [[:digit:]], X, and [[:alnum:]], then you end up needing five distinct
> "colors": 3, 7, X, all digits that aren't 3 or 7, all alphanumerics
> that aren't any of the preceding.  And state transitions for the digit
> and alnum cases had better mention all and only the correct colors.

Yeah, that's unfortunate. On the other hand, if you don't use colors
for this case, aren't you going to need, for each DFA state, a
gigantic lookup table that includes every character in the server
encoding? Even if you've got plenty of memory, initializing such a
beast seems awfully expensive, and it might not do very good things
for cache locality, either.

> I've been tracing through the logic this evening, and it works pretty
> simply given that all named character classes are immediately expanded
> out to their component characters.  If we are going to try to keep
> the classes in some kind of symbolic form, it's a lot messier.  In
> particular, I think your sketch above would lead to having to test
> every character against iswdigit and iswalnum at runtime, which would
> be disastrous performancewise.  I'd like to at least avoid that for the
> shorter (and presumably more common) UTF8 codes.

Hmm, but you could cache that information. Instead of building a
cache that covers every possible character that might appear in the
target string, you can just cache the results for the code points that
you actually see.

Yet another option would be to dictate that the cache can't holes - it
will always include information for every code point from 0 up to some
value X. If we see a code point in the target string which is greater
than X, then we extend the cache out as far as that code point. That
way, people who are using only code points out to U+FF (or even U+7F)
don't pay the cost of building a large cache, but people who need it
can get correct behavior.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2012-02-19 05:24:12 wal_buffers
Previous Message Tom Lane 2012-02-19 04:49:10 Re: Future of our regular expression code