From: | Thom Brown <thom(at)linux(dot)com> |
---|---|
To: | Jeff Davis <pgsql(at)j-davis(dot)com> |
Cc: | Vik Fearing <vik(at)postgresfriends(dot)org>, Joe Conway <mail(at)joeconway(dot)com>, Ian Lawrence Barwick <barwick(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Peter Eisentraut <peter(at)eisentraut(dot)org> |
Subject: | Re: Add CASEFOLD() function. |
Date: | 2025-06-19 04:03:35 |
Message-ID: | CAA-aLv7KLoT9yCdiJwRP9PeL_4yNTzQ3T8WJLbTtX=Ld45UOpg@mail.gmail.com |
Views: | Whole Thread | Raw Message | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Thu, 19 Jun 2025, 03:53 Jeff Davis, <pgsql(at)j-davis(dot)com> wrote:
> On Wed, 2025-06-18 at 19:09 +0200, Vik Fearing wrote:
> > I don't know. I am just pointing out what the Standard says. I
> > think
> > we should either comply, or say that we don't do it for LOWER and
> > UPPER
> > so let's keep things implementation-consistent.
>
> For the standard, I see two potential philosophies:
>
> I. CASEFOLD() is another variant of LOWER()/UPPER(), and it should
> preserve NFC in the same way.
>
> II. CASEFOLD() is not like LOWER()/UPPER(); it returns a semi-opaque
> text value that is useful for caseless matching, but should not
> ordinarily be used for display or sent to the application (those things
> would be allowed, just not encouraged). For normalization, either:
> (A) Follow Unicode Default Caseless Matching (16.0 3.13.5 D144), and
> don't require any kind of normalization; or
> (B) Follow Unicode Canonical Caseless Matching (D145), and require
> that the input and output are normalized appropriately, but leave the
> precise normal form as implementation-defined.
>
>
> The current implementation could either be seen as philosophy (I) where
> we've chosen to ignore the normalization part for the sake of
> consistency with LOWER()/UPPER(); or it could be seen as philosophy
> (II)(A).
>
> > How much does it cost to check for NFC? I honestly don't know the
> > answer to that question, but that is the only case where we need to
> > maintain normalization.
>
> I attached a very rough patch and ran a very simple test on strings
> averaging 36 bytes in length, all already in NFC and the result is also
> NFC. Before the patch, doing a CASEFOLD() on 10M tuples took about 3
> seconds, afterward about 8.
>
> There's a patch to optimize some of the normalization paths, which I
> haven't had a chance to review yet. So those numbers might come down.
>
> >
> > It's not unconditionally, it's only if the input was NFC.
>
> Optimizing the case where the input is _not_ NFC seems strange to me.
> If we are normalizing the output, I'd say we should just make the
> output always NFC. Being more strict, this seems likely to comply with
> the eventual standard.
>
> Additionally, if we are normalizing the output, then we should also do
> the input fixup for U+0345, which would make the result usable for
> Canonical Caseless Matching. Again, this seems likely to comply with
> the eventual standard.
>
> >
>
> So I only see two reasonable implementations:
>
> 1. The current CASEFOLD() implementation.
>
> 2. Do the input fixup for U+0345 and unconditionally normalize the
> output in NFC.
>
> If there's a case to be made for both implementations, we could also
> consider having two functions, say, CASEFOLD() for #1 and NCASEFOLD()
> for #2. I'm not sure whether we'd want to standardize one or both of
> those functions.
>
> And if you think there's likely to be a collision with the standard
> that's hard to anticipate and fix now, then we should consider
> reverting CASEFOLD() for 18 and wait for more progress on the
> standardization. What's the likelihood that the name changes or
> something like that?
>
Late to the party, but is there an argument for porting this to the citext
type? Or supplementing the extension with an additional type ("cftext"?
*shrug*). It currently uses lower(), so our current recommendation for
dealing with all unicode characters is to use nondeterministic collations.
Thom
>
From | Date | Subject | |
---|---|---|---|
Next Message | Peter Smith | 2025-06-19 04:14:16 | Re: [WIP]Vertical Clustered Index (columnar store extension) - take2 |
Previous Message | Jeff Davis | 2025-06-19 02:53:01 | Re: Add CASEFOLD() function. |