Re: Better locale-specific-character-class handling for regexps

From: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgreSQL(dot)org, Bruno Wolff III <bruno(at)wolff(dot)to>
Subject: Re: Better locale-specific-character-class handling for regexps
Date: 2016-09-05 07:05:29
Message-ID: e2d076ae-4685-f164-5a4a-05e7a0918793@iki.fi
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 09/04/2016 08:44 PM, Tom Lane wrote:
> Heikki Linnakangas <hlinnaka(at)iki(dot)fi> writes:
>> On 08/23/2016 03:54 AM, Tom Lane wrote:
>> +1 for this patch in general. Some regression test cases would be nice.
>
> I'm not sure how to write such tests without introducing insurmountable
> platform dependencies --- particularly on platforms with weak support for
> UTF8 locales, such as OS X. All the interesting cases require knowing
> what iswalpha() etc will return for some high character codes.
>
> What I did to test it during development was to set MAX_SIMPLE_CHR to
> something in the ASCII range, so that the high-character-code paths could
> be tested without making any assumptions about locale classifications for
> non-ASCII characters. I'm not sure that's a helpful idea for regression
> testing purposes, though.
>
> I guess I could follow the lead of collate.linux.utf8.sql and produce
> a test that's only promised to pass on one platform with one encoding,
> but I'm not terribly excited by that. AFAIK that test file does not
> get run at all in the buildfarm or in the wild.

I'm not too worried if the tests don't get run regularly, but I don't
like the idea that only works on one platform. This code is unlikely to
be accidentally broken by unrelated changes in the backend, as the
regexp code is very well isolated. But for someone hacks on the regexp
library in the future, having a test suite to tickle all these
corner-cases would be very handy.

Another class of regressions would be that something changes in the way
a locale treats some characters, and that breaks an application. That
would be very difficult to test for in a platform-independent way. That
wouldn't really our bug, though, but the locale's.

Since we're now de facto maintainers of this regexp library, and our
version could be used somewhere else than PostgreSQL too, it would
actually be nice to have a regression suite that's independent from the
pg_regress infrastructure, and wouldn't need a server to run. Perhaps a
stand-alone C program that compiles the regexp code with mock versions
of pg_wc_is* functions. Or perhaps a magic collation OID that makes
pg_wc_is* functions to return hard-coded values for particular inputs.

- Heikki

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Haribabu Kommi 2016-09-05 07:09:53 Re: pg_hba_file_settings view patch
Previous Message Michael Paquier 2016-09-05 07:05:06 Re: LSN as a recovery target