Dealing with collation and strcoll/strxfrm/etc

From: Stephen Frost <sfrost(at)snowman(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: obartunov(at)gmail(dot)com, Peter Geoghegan <pg(at)heroku(dot)com>, Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Dealing with collation and strcoll/strxfrm/etc
Date: 2016-03-28 14:57:04
Message-ID: 20160328145704.GP3127@tamriel.snowman.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

All,

Changed the thread name (we're no longer talking about release
notes...).

* Tom Lane (tgl(at)sss(dot)pgh(dot)pa(dot)us) wrote:
> Oleg Bartunov <obartunov(at)gmail(dot)com> writes:
> > Should we start thinking about ICU ?
>
> Isn't it still true that ICU fails to meet our minimum requirements?
> That would include (a) working with the full Unicode character range
> (not only UTF16) and (b) working with non-Unicode encodings. No doubt
> we could deal with (b) by inserting a conversion, but that would take
> a lot of shine off the performance numbers you mention.
>
> I'm also not exactly convinced by your implicit assumption that ICU is
> bug-free.

We have a wiki page about ICU. I'm not sure that it's current, but if
it isn't and people are interested then perhaps we should update it:

https://wiki.postgresql.org/wiki/Todo:ICU

If we're going to talk about minimum requirements, I'd like to argue
that we require whatever system we're using to have versioning (which
glibc currently lacks, as I understand it...) to avoid the risk that
indexes will become corrupt when whatever we're using for collation
changes. I'm pretty sure that's already bitten us on at least some
RHEL6 -> RHEL7 migrations in some locales, even forgetting the issues
with strcoll vs. strxfrm.

Regarding key abbreviation and performance, if we are confident that
strcoll and strxfrm are at least independently internally consistent
then we could consider offering an option to choose between them.
We'd need to identify what each index was built with to do so, however,
as they would need to be rebuilt if the choice changes, at least
until/unless they're made to reliably agree. Even using only one or the
other doesn't address the versioning problem though, which is a problem
for all currently released versions of PG and is just going to continue
to be an issue.

Thanks!

Stephen

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2016-03-28 15:08:05 Re: Draft release notes for next week's releases
Previous Message Anastasia Lubennikova 2016-03-28 14:29:53 Re: [WIP] Effective storage of duplicates in B-tree index.