Re: Collation versioning

From: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To: David Rowley <dgrowleyml(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Julien Rouhaud <rjuju123(at)gmail(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Robert Haas <robertmhaas(at)gmail(dot)com>, Laurenz Albe <laurenz(dot)albe(at)cybertec(dot)at>, Douglas Doole <dougdoole(at)gmail(dot)com>, Christoph Berg <myon(at)debian(dot)org>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Juan José Santamaría Flecha <juanjo(dot)santamaria(at)gmail(dot)com>
Subject: Re: Collation versioning
Date: 2020-11-03 21:48:24
Message-ID: CA+hUKGKmcG6Khn7D59aDuUQkFLMvtN94SR2u_nBFMuRbPWWmXg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Nov 3, 2020 at 4:38 PM Thomas Munro <thomas(dot)munro(at)gmail(dot)com> wrote:
> On Tue, Nov 3, 2020 at 1:51 PM David Rowley <dgrowleyml(at)gmail(dot)com> wrote:
> > On Tue, 3 Nov 2020 at 12:29, David Rowley <dgrowleyml(at)gmail(dot)com> wrote:
> > > Running low on ideas for now, so thought I'd post this in case it
> > > someone thinks of something else.
> >
> > FWIW, the attached does fix the issue for me. It basically just calls
> > the function that converts the windows-type "English_New Zealand.1252"
> > locale name string into, e.g. "en_NZ". Then, since GetNLSVersionEx()
> > wants yet another variant with a - rather than an _, I've just added a
> > couple of lines to swap the _ for a -. There's a bit of extra work
> > there since IsoLocaleName() just did the opposite, so perhaps doing it
> > that way was lazy of me. I'd have invented some other function if I
> > could have thought of a meaningful name for it, then just have the ISO
> > version of it swap - for _.
>
> Thanks! Hmm, it looks like Windows calls the hyphenated ISO
> language-country form a "tag". It makes me slightly nervous to ask
> for the version of a transformed name with the encoding stripped, but
> it does seem entirely plausible that it gives the answer we seek. I
> suppose if we were starting from a clean slate we might want to
> perform this transformation up front so that we have it in datcollate
> and then not have to think about the older form ever again. If we
> decided to do that going forward, the last trace of that problem would
> live in pg_upgrade. If we ever extend pg_import_system_collations()
> to cover Windows, we should make sure it captures the tag form.

So we have:

1. Windows locale names, like "English_United States.1252". Windows
still returns these from setlocale(), so they finish up in datcollate,
and yet some relevant APIs don't accept them, at least on some
machines.

2. BCP 47/RFC 5646 language tags, like "en-US". Windows uses these
in relevant new APIs, including the case in point.

3. Unix-style (XPG? ISO/IEC 15897?) locale names, like "en_US"
("language[_territory[(dot)codeset]][(at)modifier]"). These are used for
message catalogues.

We have a VS2015+ way of converting from form 1 to form 2 (and thence
3 by s/-/_/), and an older way. Unfortunately, the new way looks a
little too fuzzy: if i'm reading it right, search_locale_enum() might
stop on either "en" or "en-AU", given "English_Australia", depending
on the search order, no? This may be fine for the purpose of looking
up error messages with gettext() (where there is only one English
language message catalogue, we haven't got around to translating our
errors into 'strayan yet), but it doesn't seem like a good way to look
up the collation version; for all I know, "en" variants might change
independently (I doubt it in practice, but in theory it's wrong). We
want the same algorithm that Windows uses internally to resolve the
old style name to a collation; in other words we probably want
something more like the code path that they took away in VS2015 :-(.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2020-11-03 21:52:29 Re: Collation versioning
Previous Message Tomas Vondra 2020-11-03 21:42:46 Re: Use of "long" in incremental sort code