Re: Built-in CTYPE provider

From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: Peter Eisentraut <peter(at)eisentraut(dot)org>, Daniel Verite <daniel(at)manitou-mail(dot)org>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Jeremy Schneider <schneider(at)ardentperf(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Built-in CTYPE provider
Date: 2024-01-22 23:33:54
Message-ID: 8f105bd641a2fcbddc7c5f0c2ce60731a70da0de.camel@j-davis.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, 2024-01-22 at 19:49 +0100, Peter Eisentraut wrote:

> >
> I don't get this argument.  Of course, people care about sorting and
> sort order.  Whether you consider this part of Unicode or adjacent to
> it, people still want it.

You said that my proposal sends a message that we somehow don't care
about Unicode, and I strongly disagree. The built-in provider I'm
proposing does implement Unicode semantics.

Surely a database that offers UCS_BASIC (a SQL spec feature) isn't
sending a message that it doesn't care about Unicode, and neither is my
proposal.

> >
> > * ICU offers COLLATE UNICODE, locale tailoring, case-insensitive
> > matching, and customization with rules. It's the solution for
> > everything from "slightly more advanced" to "very advanced".
>
> I am astonished by this.  In your world, do users not want their text
> data sorted?  Do they not care what the sort order is? 

I obviously care about Unicode and collation. I've put a lot of effort
recently into contributions in this area, and I wouldn't have done that
if I thought users didn't care. You've made much greater contributions
and I thank you for that.

The logical conclusion of your line of argument would be that libc's
"C.UTF-8" locale and UCS_BASIC simply should not exist. But they do
exist, and for good reason.

One of those good reasons is that only *human* users care about the
human-friendliness of sort order. If Postgres is just feeding the
results to another system -- or an application layer that re-sorts the
data anyway -- then stability, performance, and interoperability matter
more than human-friendliness. (Though Unicode character semantics are
still useful even when the data is not going directly to a human.)

> You consider UCA
> sort order an "advanced" feature?

I said "slightly more advanced" compared with "basic". "Advanced" can
be taken in either a positive way ("more useful") or a negative way
("complex"). I'm sorry for the misunderstanding, but my point was this:

* The builtin provider is for people who are fine with code point order
and no tailoring, but want Unicode character semantics, collation
stability, and performance.

* ICU is the right solution for anyone who wants human-friendly
collation or tailoring, and is willing to put up with some collation
stability risk and lower collation performance.

Both have their place and the user is free to mix and match as needed,
thanks to the COLLATE clause for columns and queries.

Regards,
Jeff Davis

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message James Coleman 2024-01-23 00:29:30 Re: Add last_commit_lsn to pg_stat_database
Previous Message Andrew Dunstan 2024-01-22 23:01:20 Re: WIP Incremental JSON Parser