Re: sortsupport for text

From: Peter Geoghegan <peter(at)2ndquadrant(dot)com>
To: Peter Eisentraut <peter_e(at)gmx(dot)net>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Greg Stark <stark(at)mit(dot)edu>, PG Hackers <pgsql-hackers(at)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>
Subject: Re: sortsupport for text
Date: 2012-06-20 10:27:50
Message-ID: CAEYLb_U5emaPtH+hPwC9q+XFkSegYg3hJS7sriv3mSEbRd7ieA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 20 June 2012 11:00, Peter Eisentraut <peter_e(at)gmx(dot)net> wrote:
> On sön, 2012-06-17 at 23:58 +0100, Peter Geoghegan wrote:
>> So if you take the word "Aßlar" here - that is equivalent to "Asslar",
>> and so strcoll("Aßlar", "Asslar") will return 0 if you have the right
>> LC_COLLATE
>
> This is not actually correct.  glibc will sort Asslar before Aßlar, and
> that is correct in my mind.

Uh, what happened here was that I assumed that it was correct, and
then went to verify it and found that it wasn't before sending the
mail, and couldn't immediately find any hard data about what
characters this did apply to, I decided to turn it into a joke. I say
this, and yet you've included that bit of the e-mail inline in your
reply, so maybe it just wasn't a very good joke.

> When a Wikipedia page on some particular language's alphabet says
> something like "$letterA and $letterB are equivalent", what it really
> means is that they are sorted the same compared to other letters, but
> are distinct when ties are broken.

I know.

>>  (if you tried this out for yourself and found that I was
>> actually lying through my teeth, pretend I said Hungarian instead of
>> German and "some really obscure character" rather than ß).
>
> Yeah, there are obviously exceptions, which led to the original change
> being made, but they are not as wide-spread as they appear to be.

True.

> The real issue in this area, I suspect, will be dealing with Unicode
> combining sequences versus equivalent precombined characters.  But
> support for that is generally crappy, so it's not urgent to deal with
> it.

I agree that it isn't urgent. However, I have an ulterior motive,
which is that in allowing for this, we remove the need to strcmp()
after each strcoll(), and consequently it becomes possible to use
strxfrm() instead. Now, we could also use a hack that's going to make
the strxfrm() blobs even bulkier still (basically, concatenate the
original text to the blob before strcmp()), but I don't want to go
there if it can possibly be avoided.

I should also point out that we pride ourselves on following the
letter of the standard when that makes sense, and we are currently not
doing that in respect of the Unicode standard.

--
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Etsuro Fujita 2012-06-20 10:31:16 Re: WIP Patch: Selective binary conversion of CSV file foreign tables
Previous Message Jeff Janes 2012-06-20 10:20:46 Re: performance regression in 9.2 when loading lots of small tables