Re: The dangers of streaming across versions of glibc: A cautionary tale

From: Peter Geoghegan <peter(dot)geoghegan86(at)gmail(dot)com>
To: Bruce Momjian <bruce(at)momjian(dot)us>
Cc: Matthew Kelly <mkelly(at)tripadvisor(dot)com>, "pgsql-general(at)postgresql(dot)org" <pgsql-general(at)postgresql(dot)org>, Matthew Spilich <mspilich(at)tripadvisor(dot)com>
Subject: Re: The dangers of streaming across versions of glibc: A cautionary tale
Date: 2014-08-07 01:12:53
Message-ID: CAEYLb_UTMgM2V_pP7qnuKZYmTYXoym-zNYVbwoU79=TuP8HE3A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Wed, Aug 6, 2014 at 5:11 PM, Bruce Momjian <bruce(at)momjian(dot)us> wrote:
> No surprise; I have been expecting to hear about such breakage, and am
> surprised we hear about it so rarely. We really have no way of testing
> for breakage either. :-(

I guess that Trip Advisor were using some particular collation that
had a chance of changing. Sorting rules for English text (so, say,
en_US.UTF-8) are highly unlikely to change. That might be much less
true for other locales.

Unicode Technical Standard #10 states:

"""
Collation order is not fixed.

Over time, collation order will vary: there may be fixes needed as
more information becomes available about languages; there may be new
government or industry standards for the language that require
changes; and finally, new characters added to the Unicode Standard
will interleave with the previously-defined ones. This means that
collations must be carefully versioned.
"""

So, the reality is that we only have ourselves to blame. :-(

LC_IDENTIFICATION serves this purpose on glibc. Here is what en_US
looks like on my machine:

"""
escape_char /
comment_char %
% Locale for English locale in the USA
% Contributed by Ulrich Drepper <drepper(at)redhat(dot)com>, 2000

LC_IDENTIFICATION
title "English locale for the USA"
source "Free Software Foundation, Inc."
address "59 Temple Place - Suite 330, Boston, MA 02111-1307, USA"
contact ""
email "bug-glibc-locales(at)gnu(dot)org"
tel ""
fax ""
language "English"
territory "USA"
revision "1.0"
date "2000-06-24"
%
category "en_US:2000";LC_IDENTIFICATION
category "en_US:2000";LC_CTYPE
category "en_US:2000";LC_COLLATE
category "en_US:2000";LC_TIME
category "en_US:2000";LC_NUMERIC
category "en_US:2000";LC_MONETARY
category "en_US:2000";LC_MESSAGES
category "en_US:2000";LC_PAPER
category "en_US:2000";LC_NAME
category "en_US:2000";LC_ADDRESS
category "en_US:2000";LC_TELEPHONE
*** SNIP ***
"""

This is a GNU extension [1]. If the OS adds a new version of a
collation, that probably accidentally works a lot of the time, because
the collation rule added or removed was fairly esoteric anyway, such
is the nature of these things. If it was something that came up a lot,
it would surely have been settled by standardization years ago.

If OS vendors are not going to give us a standard API for versioning,
we're hosed. I thought about suggesting that we hash a strxfrm() blob
for about 2 minutes, before realizing that that's a stupid idea. Glibc
would be a good start.

[1] https://www.gnu.org/software/autoconf/manual/autoconf-2.63/html_node/Special-Shell-Variables.html
--
Regards,
Peter Geoghegan

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Phoenix Kiula 2014-08-07 01:21:17 Need help in tuning
Previous Message Bruce Momjian 2014-08-07 00:11:37 Re: The dangers of streaming across versions of glibc: A cautionary tale