Re: Dealing with collation and strcoll/strxfrm/etc

From: Stephen Frost <sfrost(at)snowman(dot)net>
To: Peter Geoghegan <pg(at)heroku(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Oleg Bartunov <obartunov(at)gmail(dot)com>, Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Dealing with collation and strcoll/strxfrm/etc
Date: 2016-03-28 19:36:54
Message-ID: 20160328193654.GW3127@tamriel.snowman.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

* Peter Geoghegan (pg(at)heroku(dot)com) wrote:
> On Mon, Mar 28, 2016 at 7:57 AM, Stephen Frost <sfrost(at)snowman(dot)net> wrote:
> > If we're going to talk about minimum requirements, I'd like to argue
> > that we require whatever system we're using to have versioning (which
> > glibc currently lacks, as I understand it...) to avoid the risk that
> > indexes will become corrupt when whatever we're using for collation
> > changes. I'm pretty sure that's already bitten us on at least some
> > RHEL6 -> RHEL7 migrations in some locales, even forgetting the issues
> > with strcoll vs. strxfrm.
>
> I totally agree that anything we should adopt should support
> versioning. Glibc does have a non-standard versioning scheme, but we
> don't use it. Other stdlibs may do versioning another way, or not at
> all. A world in which ICU is the defacto standard for Postgres (i.e.
> the actual standard on all major platforms), we mostly just have one
> thing to target, which seems like something to aim for.

Having to figure out how each and every stdlib does versioning doesn't
sound fun, I certainly agree with you there, but it hardly seems
impossible. What we need, even if we look to move to ICU, is a place to
remember that version information and a way to do something when we
discover that we're now using a different version.

I'm not quite sure what the best way to do that is, but I imagine it
involves changes to existing catalogs or perhaps even a new one. I
don't have any particularly great ideas for existing releases (maybe
stash information in the index somewhere when it's rebuilt and then
check it and throw an ERROR if they don't match?)

> The question is only how we deal with this when it happens. One thing
> that's attractive about ICU is that it makes this explicit, both for
> the logical behavior of a collation, as well as the stability of
> binary sort keys (Glibc's versioning seemingly just does the former).
> So the equivalent of strxfrm() output has license to change for
> technical reasons that are orthogonal to the practical concerns of
> end-users about how text sorts in their locale. ICU is clear on what
> it takes to make binary sort keys in indexes work. And various major
> database systems rely on this being right.

There seems to be some disagreement about if ICU provides the
information we'd need to make a decision or not. It seems like it
would, given its usage in other database systems, but if so, we need to
very clearly understand exactly how it works and how we can depend on
it.

> > Regarding key abbreviation and performance, if we are confident that
> > strcoll and strxfrm are at least independently internally consistent
> > then we could consider offering an option to choose between them.
>
> I think they just need to match, per the standard. After all,
> abbreviation will sometimes require strcoll() tie-breakers.

Ok, I didn't see that in the man-pages. If that's the case then it
seems like there isn't much hope of just using strxfrm().

Thanks!

Stephen

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jim Nasby 2016-03-28 20:02:44 SQL access to access method details
Previous Message Fabrízio de Royes Mello 2016-03-28 19:11:29 Re: Sequence Access Method WIP