Notes about fixing regexes and UTF-8 (yet again)

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: pgsql-hackers(at)postgreSQL(dot)org
Subject: Notes about fixing regexes and UTF-8 (yet again)
Date: 2012-02-15 23:06:36
Message-ID: 24241.1329347196@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

In bug #6457 it's pointed out that we *still* don't have full
functionality for locale-dependent regexp behavior with UTF8 encoding.
The reason is that there's old crufty code in regc_locale.c that only
considers character codes up to 255 when searching for characters that
should be considered "letters", "digits", etc. We could fix that, for
some value of "fix", by iterating up to perhaps 0xFFFF when dealing with
UTF8 encoding, but the time that would take is unappealing. Especially
so considering that this code is executed afresh anytime we compile a
regex that requires locale knowledge.

I looked into the upstream Tcl code and observed that they deal with
this by having hard-wired tables of which Unicode code points are to be
considered letters etc. The tables are directly traceable to the
Unicode standard (they provide a script to regenerate them from files
available from unicode.org). Nonetheless, I do not find that approach
appealing, mainly because we'd be risking deviating from the libc locale
code's behavior within regexes when we follow it everywhere else.
It seems entirely likely to me that a particular locale setting might
consider only some of what Unicode says are letters to be letters.

However, we could possibly compromise by using Unicode-derived tables
as a guide to which code points are worth probing libc for. That is,
assume that a utf8-based locale will never claim that some code is a
letter that unicode.org doesn't think is a letter. That would cut the
number of required probes by a pretty large factor.

The other thing that seems worth doing is to install some caching.
We could presumably assume that the behavior of iswupper() et al are
fixed for the duration of a database session, so that we only need to
run the probe loop once when first asked to create a cvec for a
particular category.

Thoughts, better ideas?

regards, tom lane

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Geoghegan 2012-02-16 00:02:48 Re: Progress on fast path sorting, btree index creation time
Previous Message Gaetano Mendola 2012-02-15 22:54:38 Re: CUDA Sorting