Re: Notes about fixing regexes and UTF-8 (yet again)

From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: Notes about fixing regexes and UTF-8 (yet again)
Date: 2012-02-17 08:48:50
Message-ID: 4F3E1472.6080403@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 16.02.2012 01:06, Tom Lane wrote:
> In bug #6457 it's pointed out that we *still* don't have full
> functionality for locale-dependent regexp behavior with UTF8 encoding.
> The reason is that there's old crufty code in regc_locale.c that only
> considers character codes up to 255 when searching for characters that
> should be considered "letters", "digits", etc. We could fix that, for
> some value of "fix", by iterating up to perhaps 0xFFFF when dealing with
> UTF8 encoding, but the time that would take is unappealing. Especially
> so considering that this code is executed afresh anytime we compile a
> regex that requires locale knowledge.
>
> I looked into the upstream Tcl code and observed that they deal with
> this by having hard-wired tables of which Unicode code points are to be
> considered letters etc. The tables are directly traceable to the
> Unicode standard (they provide a script to regenerate them from files
> available from unicode.org). Nonetheless, I do not find that approach
> appealing, mainly because we'd be risking deviating from the libc locale
> code's behavior within regexes when we follow it everywhere else.
> It seems entirely likely to me that a particular locale setting might
> consider only some of what Unicode says are letters to be letters.
>
> However, we could possibly compromise by using Unicode-derived tables
> as a guide to which code points are worth probing libc for. That is,
> assume that a utf8-based locale will never claim that some code is a
> letter that unicode.org doesn't think is a letter. That would cut the
> number of required probes by a pretty large factor.
>
> The other thing that seems worth doing is to install some caching.
> We could presumably assume that the behavior of iswupper() et al are
> fixed for the duration of a database session, so that we only need to
> run the probe loop once when first asked to create a cvec for a
> particular category.
>
> Thoughts, better ideas?

Here's a wild idea: keep the class of each codepoint in a hash table.
Initialize it with all codepoints up to 0xFFFF. After that, whenever a
string contains a character that's not in the hash table yet, query the
class of that character, and add it to the hash table. Then recompile
the whole regex and restart the matching engine.

Recompiling is expensive, but if you cache the results for the session,
it would probably be acceptable.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Dimitri Fontaine 2012-02-17 08:54:17 Re: Command Triggers
Previous Message Guillaume Lelarge 2012-02-17 08:42:07 Re: Bug in intarray?