Re: Collation version tracking for macOS

From: Jeremy Schneider <schneider(at)ardentperf(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com>
Cc: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Collation version tracking for macOS
Date: 2022-06-03 19:13:33
Message-ID: 1874de62-6bec-4bc1-1d14-0a2730b125da@ardentperf.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 6/3/22 9:21 AM, Tom Lane wrote:
>
> According to that document, they changed it in macOS 11, which came out
> a year and a half ago. Given the lack of complaints, it doesn't seem
> like this is urgent enough to mandate a post-beta change that would
> have lots of downside (namely, false-positive warnings for every other
> macOS update).

Sorry, I'm going to rant for a minute... it is my very strong opinion
that using language like "false positive" here is misguided and dangerous.

If new version of sort order is released, for example when they recently
updated backwards-secondary sorting in french [CLDR-2905] or matching of
v and w in swedish and finnish [CLDR-7088], it is very dangerous to use
language like “false positive” to describe a database where there just
didn't happen to be any rows with accented french characters at the
point in time where PostgreSQL magically changed which version of sort
order it was using from the 2010 french version to the 2020 french version.

No other piece of software that calls itself a database would do what
PostgreSQL is doing: just give users a "warning" after suddenly changing
the sort order algorithm (most users won't even read warnings in their
logs). Oracle, DB2, SQL Server and even MySQL carefully version
collation data, hardcode a pseudo-linguistic collation into the DB (like
PG does for timezones), and if they provide updates to linguistic sort
order (from Unicode CLDR) then they allow the user to explicitly specify
which version of french or german ICU sorting they are want to use.
Different versions are treated as different sort orders; they are not
conflated.

I have personally seen PostgreSQL databases where an update to an old
version of glibc was applied (I'm not even talking 2.28 here) and it
resulted in data loss b/c crash recovery couldn't replay WAL records and
the user had to do a PITR. That's aside from the more common issues of
segfaults or duplicate records that violate unique constraints or wrong
query results like missing data. And it's not just updates - people can
set up a hot standby on a different version and see many of these
problems too.

Collation versioning absolutely must be first class and directly
controlled by users, and it's very dangerous to allow users - at all -
to take an index and then use a different version than what the index
was built with.

Not to mention all the other places in the DB where collation is used...
partitioning, constraints, and any other place where persisted data can
make an assumption about any sort of string comparison.

It feels to me like we're still not really thinking clearly about this
within the PG community, and that the seriousness of this issue is not
fully understood.

-Jeremy Schneider

--
http://about.me/jeremy_schneider

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Thomas Munro 2022-06-03 19:23:05 Re: Collation version tracking for macOS
Previous Message Nathan Bossart 2022-06-03 17:29:11 Re: Proposal: adding a better description in psql command about large objects