Re: Collation version tracking for macOS

From: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To: Jeremy Schneider <schneider(at)ardentperf(dot)com>
Cc: Jim Nasby <nasbyj(at)amazon(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Collation version tracking for macOS
Date: 2022-06-12 22:36:02
Message-ID: CA+hUKGLgzQU=zJneV4ASAmVDXyFPB6CY5MKMHx6trQJaFGGREw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hey Jeremy,

On Tue, Jun 7, 2022 at 12:42 PM Jeremy Schneider
<schneider(at)ardentperf(dot)com> wrote:
> Thomas - thanks for the link back to one of the threads. I spent some time reading through that and it’s a lot of material; I haven’t read the whole thread yet. If you have some others that would also be particularly good background, let me know. I’m doing a chunk of this in my spare time at the moment, but I do want to keep getting more up to speed. I was pulled into a bunch of various things related to PostgreSQL and ICU and collation and OS’s over the past couple years, so I learned a lot from on-the-ground experience and I am interested in trying to get a little more involved in the conversation here.

There were more threads, but they mostly say the same things, hence my
current attempt to move from bloviation to trying out the ideas with
actual code :-D

> Personally, I really do think there should at least be an *option* to tell the DB to fully error rather than just warn on version mismatch. Correctness matters to many users, and being able to *trust* string comparisons are correct is pretty damn fundamental all throughout a database. It really doesn’t get any more basic and the potential for bad things to happen is pretty astronomical, if you can’t trust those. I understand the consternation about dealing with upgrades of large & busy databases, but I’m still surprised that the community consensus arrived at the present behavior, and I have a lot of reading to do, to really understand how that happened and where the dialogue is today.

Given that the only thing you could do about it is REINDEX, and yet we
don't even know which indexes needed to be REINDEXed (the problem
Julien and I tried to address, but so far without success), it seemed
highly premature to convert the warning to an error.

I don't think the community consensus is that we have arrived
somewhere, it's more like we're in transit, possibly without a map.
For example 15 gains ICU support for the default collation (= how most
people consume collations), which changes things, and will surely lead
to more people thinking about this problem space.

> Multiple versions of ICU sounds nice for users who need real linguistic collation (like what Oracle and DB2 offer), but I still feel like there needs to be a super simple basic “pseudo-linguistic” collation baked in, that’s “good enough” for 99% of users and that is guaranteed to be the same everywhere on every platform and just won’t ever change. I think glibc needs to be phased out somehow. At a minimum, not the default for new users… to stop the bleeding. If MySQL wasn’t GPL then I’d say to just copy their collations. I’d be reluctant to spend too much time on a POC now though, it feels like my idea is the outlier and the general PG hacker consensus would be to reject this idea. (But maybe I’m wrong?)

Hmm. Well I personally wouldn't try to write a collator any more
willingly than I'd try to write a new cryptographic algorithm, just
not my bag. We don't want to handle complaints about our sort order
(we already bat away complaints about glibc's, and I heard an account
from an OS vendor about the non-stop contradictory crank complaints
about sort order they used to receive before they gave up and adopted
ICU). ICU really is quite authoritative here.

If you mean that you don't even want to have to specify a language
like "en", then note that you don't have to: ICU has a "root" collator
which you can request with an empty string (all other collators apply
cultural tweaks on top of that). Like everything else, the root
collator has changed over time, though.

With my "distinct" experimental patch (v4), you could set your
database default collation to a specific ICU major version's root
collator like so:

create database ... locale_provider = icu icu_locale = '71:' template
= template0

That'll keep working, even across pg_upgrades to some version of
PostgreSQL far in the future that is linked against ICU 100, by
dlopen'ing the .71 libraries, for as long as you can get your hands on
a libicu71 package or otherwise build your own, and it'll error out if
it can't open that library, which may be the hard error you were
looking for. If there's an API change in ICU we'll have to make some
changes, but that's already true.

Review/testing/flames/rants/better ideas welcome.

> Anyway, again, apologies for my pants-on-fire email last week. I hope I can enjoy a few beers someday - or coffee for the non-drinkers - with a few other PG collation nerds (which I never set out to be, but it may have befallen me <g>).

+1

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Julien Rouhaud 2022-06-13 02:32:13 Re: Add header support to text format and matching feature
Previous Message Andrew Dunstan 2022-06-12 21:58:54 Re: Improve TAP tests of pg_upgrade for cross-version tests