From: | Nishant Sharma <nishant(dot)sharma(at)enterprisedb(dot)com> |
---|---|
To: | Robert Haas <robertmhaas(at)gmail(dot)com> |
Cc: | pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: Problem in 'ORDER BY' of a column using a created collation? |
Date: | 2025-10-01 10:02:33 |
Message-ID: | CADrsxdbBn9NShy41AXUi2NUWUV48arffMcQbpzy8c-=e0Q1HSw@mail.gmail.com |
Views: | Whole Thread | Raw Message | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Fri, Sep 26, 2025 at 6:46 PM Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> Generally, there are two methods for performing collation-aware string
> comparisons, and they are expected to produce equivalent results. For
> libc collations, the first method is to use strcoll() or strcoll_l()
> or similar to directly compared the strings, and the other is to use
> strxfrm() or similar to convert the string to a normalize
> representation which can then be compared using memcmp(). Equivalent
> facilities exist for ICU; see collate_methods_icu and
> collate_methods_icu_utf8 in pg_local_icu.c; one or the other of those
> should be getting used here.
>
> Even though these two methods are expected to produce equivalent
> results, if there's a bug, they might not. That bug could either exist
> in our code or it could exist in ICU; I suspect the latter is more
> likely, but the former is certainly possible. What I think you want to
> do is try to track down two specific strings where transforming them
> via strnxfrm_icu() produces results that compare in one order using
> memcmp(), but passing the same strings to strncoll_icu() or
> strncoll_icu_utf8() -- whichever is appropriate -- produces a
> different result. If you're able to find such strings, then you can
> probably rephrase those examples in terms of whatever underlying
> functions strncoll_icu() and strxfrm_icu() are calling and we can
> maybe report the problem to the ICU maintainers. If no such strings
> seem to exist, then there's probably a problem in the PostgreSQL code
> someplace, and you should try to figure out why we're getting the
> wrong answer despite ICU apparently doing the right thing.
>
> --
> Robert Haas
> EDB: http://www.enterprisedb.com
Thanks Robert for your reply!
I was able to create an independent C src file. With no relation to PG in
it.
Did a simple comparison of the two methods mentioned by Robert for '1'
& 'a' as input strings. The ASCENDING SORT ORDER of them is printed.
Here is how I compiled the C src code:-
"gcc icu_issue_repro.c -L/usr/lib -licui18n -licuuc -licudata -o
icu_issue_repro.o"
Here is its output:-
------------------------------------------------------------------------------------------
Testing sort order for '1' & 'a' using ICU library with collation =
'ja-u-kr-latn-digit'
With Method ucol_strcollUTF8():
SORT ORDER ASC : '1', 'a'
With Method ucol_nextSortKeyPart() (i.e transform and memcmp):
SORT ORDER ASC : 'a', '1'
Testing Ends
------------------------------------------------------------------------------------------
We can clearly see that both methods are showing different sort order
for the same input strings.
So, I think this helps confirm that the problem lies in the ICU library
method "ucol_strcollUTF8()".
Robert, if you agree with my shared details, can you please help me
with details on how I can share this issue to ICU maintainers for the
correction?
Thank you!
PFA, the C src code.
Regards,
Nishant Sharma,
EDB, Pune.
Attachment | Content-Type | Size |
---|---|---|
icu_issue_repro.c | application/octet-stream | 3.4 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Arseniy Mukhin | 2025-10-01 10:08:33 | Re: The ability of postgres to determine loss of files of the main fork |
Previous Message | Richard Guo | 2025-10-01 09:29:54 | Re: Fix incorrect function reference BufFileOpenShared in comment. |