Re: Does UCS_BASIC have the right CTYPE?

From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: Peter Eisentraut <peter(at)eisentraut(dot)org>, pgsql-hackers(at)postgresql(dot)org
Cc: Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com>, Vik Fearing <vik(at)2ndquadrant(dot)fr>
Subject: Re: Does UCS_BASIC have the right CTYPE?
Date: 2023-10-26 18:42:27
Message-ID: 70b79878856d4f2cabe67fb8e3420a92ea641214.camel@j-davis.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, 2023-10-26 at 09:21 -0700, Jeff Davis wrote:
> Our initcap() is not defined in the standard, and we document that it
> only differentiates between alphanumeric and non-alphanumeric
> characters, so we could get that behavior pretty easily as well. If
> we
> wanted to do it the Unicode way instead, we can follow the
> toTitlecase() part of the Default Case Algorithm, which is based on
> word breaks and would require another lookup table for that.

Correction: the rules for word breaks are fairly complex, so it would
not be worth it to try to replicate that just to support initcap(). We
could just use the simple, existing, and documented rules for initcap()
which only differentiate between alphanumeric and not. Anyone who wants
the more sophisticated rules can just use an ICU collation with
initcap().

The point stands that it would be pretty simple to have a collation
that handles upper() and lower() in a standards-compliant way without
relying on libc or ICU. Unfortunately it's too late to call that
collation UCS_BASIC, but it would still be useful.

Regards,
Jeff Davis

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 2023-10-26 19:43:29 Re: Partial aggregates pushdown
Previous Message Andres Freund 2023-10-26 17:41:36 Re: visibility of open cursors in pg_stat_activity