Re: Better locale-specific-character-class handling for regexps

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc: pgsql-hackers(at)postgreSQL(dot)org, Bruno Wolff III <bruno(at)wolff(dot)to>
Subject: Re: Better locale-specific-character-class handling for regexps
Date: 2016-09-04 17:44:06
Message-ID: 20763.1473011046@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Heikki Linnakangas <hlinnaka(at)iki(dot)fi> writes:
> On 08/23/2016 03:54 AM, Tom Lane wrote:
>> ! the color map for characters above MAX_SIMPLE_CHR is really a 2-D array,
>> ! whose rows correspond to character ranges that are explicitly mentioned
>> ! in the input, and whose columns correspond to sets of relevant locale
>> ! character classes.

> I think that last sentence should say "whose rows correspond to
> character ranges that are explicitly mentioned in the *regex pattern*",
> rather than "in the input".

OK. The pattern is the input under consideration by regcomp.c, but I see
your point that someone might be confused.

> An example would be very helpful here.

I'll see what I can do.

> I can easily come up with examples for
> character classes, but not for "high" character-ranges. The best I could
> come up with is to check if a characters belongs to some special group
> of unicode characters, like U&'[\+01D100-\+01D1FF]' to check for musical
> symbol characters.

Right, or someone might write the equivalent of [A-Z] in the Klingon
alphabet, or whatever --- it wouldn't necessarily have to use backslash
notation, it could be literal characters with large codes.

> In practice, I guess you will only see single
> characters in the colormaprange array, although we must of course cope
> with ranges too.

I suspect that's true; I'm not sure that there are many places in the
upper reaches of Unicode space where ranges of consecutive character
codes form natural units to search for.

>> + /* this relies on WHITE being zero: */
>> + memset(cm->locolormap, WHITE,
>> + (MAX_SIMPLE_CHR - CHR_MIN + 1) * sizeof(color));

> Is the comment correct? I don't see why this wouldn't work with "WHITE
> != 0".

The array entries are shorts not chars, so if WHITE were say 2, memset
would fill the array with 0x0202 rather than 2. I suppose if WHITE were
say 0x0303 then it would accidentally work, but I don't think the comment
needs to get into that.

> Perhaps "backwards" would be clearer than "downwards"?

OK.

> +1 for this patch in general. Some regression test cases would be nice.

I'm not sure how to write such tests without introducing insurmountable
platform dependencies --- particularly on platforms with weak support for
UTF8 locales, such as OS X. All the interesting cases require knowing
what iswalpha() etc will return for some high character codes.

What I did to test it during development was to set MAX_SIMPLE_CHR to
something in the ASCII range, so that the high-character-code paths could
be tested without making any assumptions about locale classifications for
non-ASCII characters. I'm not sure that's a helpful idea for regression
testing purposes, though.

I guess I could follow the lead of collate.linux.utf8.sql and produce
a test that's only promised to pass on one platform with one encoding,
but I'm not terribly excited by that. AFAIK that test file does not
get run at all in the buildfarm or in the wild.

Thanks for reviewing!

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Emre Hasegeli 2016-09-04 17:59:41 Re: [PATCH] Alter or rename enum value
Previous Message Emre Hasegeli 2016-09-04 17:42:33 Re: Floating point comparison inconsistencies of the geometric types