Quick Links

Re: Losing my latin on Ordering...

From:	Dominique Devienne <ddevienne(at)gmail(dot)com>
To:	Laurenz Albe <laurenz(dot)albe(at)cybertec(dot)at>
Cc:	pgsql-general(at)postgresql(dot)org
Subject:	Re: Losing my latin on Ordering...
Date:	2023-02-14 11:17:47
Message-ID:	CAFCRh-_K09U+Zr0YY-nV7ts++WZ-XAR2kG3osNC03t6_LwA-xA@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-general

On Tue, Feb 14, 2023 at 11:23 AM Laurenz Albe <laurenz(dot)albe(at)cybertec(dot)at>
wrote:

> On Tue, 2023-02-14 at 10:31 +0100, Dominique Devienne wrote:
> > Hi. Porting a unit test to PostgreSQL, we got a failure related to
> ordering.
> >
> > We've distilled it to the below. The DB is en_US.UTF-8, and the sorting
> we get
> > does not make sense to me. The same prefix can be sorted differently
> based on
> > the suffix apprently, which doesn't make any sense to me.
> >
> > Surely sorting should be "constant left-to-right", no? What are we
> missing?
>
> No, it isn't. That's not how natural language collations work.
>

Honestly, who expects the same prefix to sort differently based on what
comes after, in left-to-right languages?
How does one even find out what the (capricious?) rules for sorting in a
given collation are?

I'm aware of sorting taking numerical numbers in text influencing sort, so
"Foo10" comes after "Foo9",
but that's not what we are discussing here. "Foo*" and "Foo " have no
logical relatioship, like 9 and 10 do.

> > I'm already surprised (star) comes before (space), when the latter "comes
> > before" the former in both ASCII and UTF-8, but that the two "Foo*" and
> "Foo "
> > prefixed pairs are not clustered after sorting is just mistifying to me.
> So how come?
>
> Because they compare identical on the first three levels. Any difference
> in
> letters, accents or case weighs stronger, even if it occurs to the right
> of these substrings.
>

That's completely unintuitive...

> > For now we can work-around this by explicitly adding the `collate "C"` on
> > the queries underlying that particular test, but that would be wrong in
> the
> > general case of international strings to sort, so I'd really like to
> understand
> > what's going on.
>
> Yes, it soulds like the "C" collation may be best for you. That is, if
> you don't
> mind that "Z" < "a".
>

I would mind if I asked for case-insensitive comparisons.

So the "C" collation is fine with general UTF-8 encoding?
I.e. it will be codepoint ordered OK?

In response to

Re: Losing my latin on Ordering... at 2023-02-14 10:23:19 from Laurenz Albe

Responses

Re: Losing my latin on Ordering... at 2023-02-14 11:35:16 from Alvaro Herrera
Re: Losing my latin on Ordering... at 2023-02-14 11:40:46 from Laurenz Albe

Browse pgsql-general by date

	From	Date	Subject
Next Message	Alvaro Herrera	2023-02-14 11:35:16	Re: Losing my latin on Ordering...
Previous Message	Amit Kapila	2023-02-14 11:02:13	Re: Support logical replication of DDLs