Re: sortsupport for text

From: Peter Geoghegan <peter(at)2ndquadrant(dot)com>
To: Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Greg Stark <stark(at)mit(dot)edu>, PG Hackers <pgsql-hackers(at)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Re: sortsupport for text
Date: 2012-06-19 18:44:35
Message-ID: CAEYLb_V1uGn_pWMStNSi=JUnGOo3JpiPwjbzmHAaWfYTqisNUA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 19 June 2012 18:57, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov> wrote:
> We weren't using en_US.UTF-8 collation (or any other "proper"
> collation) on Sybase -- I'm not sure whether they even supported
> proper collation sequences on the versions we used.  I'm thinking of
> when we were using their "case insensitive" sorting.  I don't know
> the implementation details, but the behavior was consistent with
> including each character-based column twice: once in the requested
> position in the ORDER BY clause but folded to a consistent case, and
> again after all the columns in the ORDER BY clause in original form,
> with C collation.
>
> I wasn't aware that en_US.UTF-8 doesn't have equivalence without
> equality.

Not that many do. The underlying cause of the problem back in 2005 was
the tacit assumption that none do, which presumably we got away with
for a while. I mentioned Hungarian, but it happens a bit in Swedish
too.

PostgreSQL supported Unicode before 2005, when the tie-breaker was
introduced. I know at least one Swede who used Postgres95. I just took
a look at the REL6_4 branch, and it looks much the same in 1999 as it
did in 2005, in that there is no tie-breaker after the strcoll(). Now,
that being the case, and Hungarian in particular having a whole bunch
of these equivalencies, I have to wonder if the original complainant's
problem really was diagnosed correctly. It could of had something to
do with the fact that texteq() was confused about whether it reported
equality or equivalency - it may have taken that long for the (len1 !=
len2) fastpath thing (only holds for equality, not equivalence,
despite the fact that the 2005-era strcoll() call checks equivalence
within texteq() ) to trip someone out, because texteq() would have
thereby given inconsistent answers in a very subtle way, that were not
correct either according to the Hungarian locale, nor according to
simple bitwise equality. That's mostly speculation, but the question
must be asked.

> I guess that surprising result in my last post is just
> plain inevitable with that collation then.  Bummer.  Is there
> actually anyone who finds that to be a useful behavior?

Looking at the details again and assuming a US locale, yeah, it is.
The substance of your complain holds though.

> For a collation which considered upper-case and lower-case to be
> equivalent, would PostgreSQL sort as I wanted, or is it doing some
> tie-break per column within equivalent values?

You could do that, and some people do use custom collations for
various reasons. That's obviously very much of minority interest
though. Most people will just use citext or something. However, since
citext is itself a client of varstr_cmp(), this won't help you.

--
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2012-06-19 18:44:40 Re: Event Triggers reduced, v1
Previous Message Andres Freund 2012-06-19 18:23:20 Re: [RFC][PATCH] Logical Replication/BDR prototype and architecture