Quick Links

Does UCS_BASIC have the right CTYPE?

From:	Jeff Davis <pgsql(at)j-davis(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Cc:	Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com>, Vik Fearing <vik(at)2ndquadrant(dot)fr>
Subject:	Does UCS_BASIC have the right CTYPE?
Date:	2023-10-25 18:32:02
Message-ID:	20d61f835afe7de89df0b038aa7fe799c53cf634.camel@j-davis.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

UCS_BASIC is defined in the standard as a collation based on comparing
the code point values, and in UTF8 that is satisfied with memcmp(), so
the collation locale for UCS_BASIC in Postgres is simply "C".

But what should the result of UPPER('á' COLLATE UCS_BASIC) be? In
Postgres, the answer is 'á', but intuitively, one could reasonably
expect the answer to be 'Á'.

Reading the standard, it seems that LOWER()/UPPER() are defined in
terms of the Unicode General Category (Section 4.2, "<fold> is a pair
of functions..."). It is somewhat ambiguous about the case mappings,
but I could guess that it means the Default Case Algorithm[1].

That seems to suggest the standard answer should be 'Á' regardless of
any COLLATE clause (though I could be misreading). I'm a bit confused
by that... what's the standard-compatible way to specify the locale for
UPPER()/LOWER()? If there is none, then it makes sense that Postgres
overloads the COLLATE clause for that purpose so that users can use a
different locale if they want.

But given that UCS_BASIC is a collation specified in the standard,
shouldn't it have ctype behavior that's as close to the standard as
possible?

Regards,
Jeff Davis

[1] https://www.unicode.org/versions/Unicode15.1.0/ch03.pdf#G33992

Responses

Re: Does UCS_BASIC have the right CTYPE? at 2023-10-26 14:49:55 from Peter Eisentraut

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	David Steele	2023-10-25 18:53:31	Remove dead code in pg_ctl.c
Previous Message	Andres Freund	2023-10-25 18:07:29	Re: ResourceOwner refactoring