Re: Collation versioning

From: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
To: Douglas Doole <dougdoole(at)gmail(dot)com>
Cc: Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, Christoph Berg <myon(at)debian(dot)org>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Collation versioning
Date: 2018-09-16 20:13:34
Message-ID: CAEepm=3-rxTzk3anR1QA=tuNrbAQ_ejJ2rj5Hp_y5hdJQr=rEw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Sep 17, 2018 at 6:13 AM Douglas Doole <dougdoole(at)gmail(dot)com> wrote:
>
> On Sun, Sep 16, 2018 at 1:20 AM Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com> wrote:
>>
>> 3. Fix the tracking of when reindexes need to be rebuilt, so that you
>> can't get it wrong (as you're alluding to above).
>
>
> I've mentioned this in the past, but didn't seem to get any traction, so I'll try it again ;-)

Hi Doug,

Probably because we agree with you, but don't have all the answers :-)

> The focus on indexes when a collation changes is, in my opinion, the least of the problems. You definitely have to worry about indexes, but they can be easily rebuilt. What about other places where collation is hardened into the system, such as constraints?

We have to start somewhere and indexes are the first thing that people
notice, and are much likely to actually be a problem (personally I've
encountered many cases of index corruption due to collation changes in
the wild, but never a constraint corruption, though I fully understand
the theoretical concern). Several of us have observed specifically
that the same problems apply to CHECK constraints and PARTITION
boundaries, and there may be other things like that. You could
imagine tracking collation dependencies on those, requiring a RECHECK
or REPARTITION operation to update them after a depended-on collation
version changes.

Perhaps that suggests that there should be a more general way to store
collation dependencies -- something more like pg_depend, rather than
bolting something like indcollversion onto indexes and every other
kind of catalog that might need it. I don't know.

> For example, in ICU 4.6 the handling of accents changed for French. Previously accents were considered right-to-left but ICU 4.6 reversed this. So consider a constraint like CHECK COL < 'coté' (last letter is U+00E9, small letter e with acute). Prior to ICU 4.6 the value 'côte' (second letter is U+00F4, small letter o with circumflex) would have passed this constraint. With 4.6 or later it would be rejected because of the accent ordering change. As soon as the collation changes, this table becomes inconsistent and a reindex isn't going to help it. This becomes a data cleansing problem at this point (which sucks for the user because their data was clean immediately prior to the "upgrade").

Yeah, that's a fun case. I haven't checked recently, but last time I
looked[1] and if I understood that byte sequence correctly, they were
still using that whacky right-to-left accent sorting logic for fr_CA,
but had given up on it in fr_FR (though there was still a way to ask
for it). Vive le Québec libre.

> ...
>
> And constraints problems are even easier than triggers. Consider a database with complex BI rules that are implemented through triggers that fire when values are/are not equal. If the equality of strings change, there could be bad data throughout the tables. (At least with constraints the inter-column dependencies are explicit in the catalogs. With triggers anything goes.)

Once you get into downstream effects of changes (whether they are
recorded in the database or elsewhere), I think it's basically beyond
our event horizon. Why and when did the collation definition change
(bug fix in CLDR, decree by the Académie Française taking effect on 1
January 2019, ...)? We could all use bitemporal databases and
multi-version ICU, but at some point it all starts to look like an
episode of Dr Who. I think we should make a clear distinction between
things that invalidate the correct working of the database, and more
nebulous effects that we can't possibly track in general.

[1] https://www.postgresql.org/message-id/CAEepm%3D30SQpEUjau%3DdScuNeVZaK2kJ6QQDCHF75u5W%3DCz%3D3Scw%40mail.gmail.com

--
Thomas Munro
http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Dave Cramer 2018-09-16 20:42:11 Re: ssl tests README and certs
Previous Message Tom Lane 2018-09-16 20:13:00 Re: XMLNAMESPACES (was Re: Clarification of nodeToString() use cases)