Re: Losing my latin on Ordering...

From: Dominique Devienne <ddevienne(at)gmail(dot)com>
To: Laurenz Albe <laurenz(dot)albe(at)cybertec(dot)at>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: Losing my latin on Ordering...
Date: 2023-02-14 11:17:47
Message-ID: CAFCRh-_K09U+Zr0YY-nV7ts++WZ-XAR2kG3osNC03t6_LwA-xA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Tue, Feb 14, 2023 at 11:23 AM Laurenz Albe <laurenz(dot)albe(at)cybertec(dot)at>
wrote:

> On Tue, 2023-02-14 at 10:31 +0100, Dominique Devienne wrote:
> > Hi. Porting a unit test to PostgreSQL, we got a failure related to
> ordering.
> >
> > We've distilled it to the below. The DB is en_US.UTF-8, and the sorting
> we get
> > does not make sense to me. The same prefix can be sorted differently
> based on
> > the suffix apprently, which doesn't make any sense to me.
> >
> > Surely sorting should be "constant left-to-right", no? What are we
> missing?
>
> No, it isn't. That's not how natural language collations work.
>

Honestly, who expects the same prefix to sort differently based on what
comes after, in left-to-right languages?
How does one even find out what the (capricious?) rules for sorting in a
given collation are?

I'm aware of sorting taking numerical numbers in text influencing sort, so
"Foo10" comes after "Foo9",
but that's not what we are discussing here. "Foo*" and "Foo " have no
logical relatioship, like 9 and 10 do.

> > I'm already surprised (star) comes before (space), when the latter "comes
> > before" the former in both ASCII and UTF-8, but that the two "Foo*" and
> "Foo "
> > prefixed pairs are not clustered after sorting is just mistifying to me.
> So how come?
>
> Because they compare identical on the first three levels. Any difference
> in
> letters, accents or case weighs stronger, even if it occurs to the right
> of these substrings.
>

That's completely unintuitive...

> > For now we can work-around this by explicitly adding the `collate "C"` on
> > the queries underlying that particular test, but that would be wrong in
> the
> > general case of international strings to sort, so I'd really like to
> understand
> > what's going on.
>
> Yes, it soulds like the "C" collation may be best for you. That is, if
> you don't
> mind that "Z" < "a".
>

I would mind if I asked for case-insensitive comparisons.

So the "C" collation is fine with general UTF-8 encoding?
I.e. it will be codepoint ordered OK?

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Alvaro Herrera 2023-02-14 11:35:16 Re: Losing my latin on Ordering...
Previous Message Amit Kapila 2023-02-14 11:02:13 Re: Support logical replication of DDLs