Re: Notes about fixing regexes and UTF-8 (yet again)

From: Vik Reykja <vikreykja(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Notes about fixing regexes and UTF-8 (yet again)
Date: 2012-02-19 03:38:31
Message-ID: CALDgxVtk41fkTcF+24b1DbytbwD=kO+K-HGbMyOwjT45TRRkiQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sun, Feb 19, 2012 at 04:33, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

> On Sat, Feb 18, 2012 at 7:29 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> >> Yeah, it's conceivable that we could implement something whereby
> >> characters with codes above some cutoff point are handled via runtime
> >> calls to iswalpha() and friends, rather than being included in the
> >> statically-constructed DFA maps. The cutoff point could likely be a lot
> >> less than U+FFFF, too, thereby saving storage and map build time all
> >> round.
> >
> > In the meantime, I still think the caching logic is worth having, and
> > we could at least make some people happy if we selected a cutoff point
> > somewhere between U+FF and U+FFFF. I don't have any strong ideas about
> > what a good compromise cutoff would be. One possibility is U+7FF, which
> > corresponds to the limit of what fits in 2-byte UTF8; but I don't know
> > if that corresponds to any significant dropoff in frequency of usage.
>
> The problem, of course, is that this probably depends quite a bit on
> what language you happen to be using. For some languages, it won't
> matter whether you cut it off at U+FF or U+7FF; while for others even
> U+FFFF might not be enough. So I think this is one of those cases
> where it's somewhat meaningless to talk about frequency of usage.
>

Does it make sense for regexps to have collations?

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2012-02-19 04:03:55 Re: Notes about fixing regexes and UTF-8 (yet again)
Previous Message Robert Haas 2012-02-19 03:33:07 Re: Notes about fixing regexes and UTF-8 (yet again)