From: | Robert Haas <robertmhaas(at)gmail(dot)com> |
---|---|
To: | Nishant Sharma <nishant(dot)sharma(at)enterprisedb(dot)com> |
Cc: | pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: Problem in 'ORDER BY' of a column using a created collation? |
Date: | 2025-09-26 13:15:51 |
Message-ID: | CA+TgmobWgZC-JkS4OKbxQsjuBh7mpKMPw5kxtcHJgN-7+vGgCg@mail.gmail.com |
Views: | Whole Thread | Raw Message | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Mon, Aug 25, 2025 at 3:52 AM Nishant Sharma
<nishant(dot)sharma(at)enterprisedb(dot)com> wrote:
> We did debug 'Experiment 1' and we find that:-
> Whatever is the Column 1 in 'ORDER BY' gets correctly ordered,
> because it uses abbreviated sort optimization due to which its data
> datum gets converted to abbreviated datum using
> "varstr_abbrev_convert()" function, and then the comparator
> function selected is
> "ssup->comparator = ssup_datum_unsigned_cmp()"
> for sorting operation. But in case of column 2 in 'ORDER BY' (which
> is showing incorrect result for 'Experiment 1') does not use
> abbreviated sort optimization and here comparator function selected
> is "ssup->comparator = varlenafastcmp_locale" -->
> "strncoll_icu_utf8()", which appears, uses the third party ICU
> library function for comparison and does not work as expected.
Generally, there are two methods for performing collation-aware string
comparisons, and they are expected to produce equivalent results. For
libc collations, the first method is to use strcoll() or strcoll_l()
or similar to directly compared the strings, and the other is to use
strxfrm() or similar to convert the string to a normalize
representation which can then be compared using memcmp(). Equivalent
facilities exist for ICU; see collate_methods_icu and
collate_methods_icu_utf8 in pg_local_icu.c; one or the other of those
should be getting used here.
Even though these two methods are expected to produce equivalent
results, if there's a bug, they might not. That bug could either exist
in our code or it could exist in ICU; I suspect the latter is more
likely, but the former is certainly possible. What I think you want to
do is try to track down two specific strings where transforming them
via strnxfrm_icu() produces results that compare in one order using
memcmp(), but passing the same strings to strncoll_icu() or
strncoll_icu_utf8() -- whichever is appropriate -- produces a
different result. If you're able to find such strings, then you can
probably rephrase those examples in terms of whatever underlying
functions strncoll_icu() and strxfrm_icu() are calling and we can
maybe report the problem to the ICU maintainers. If no such strings
seem to exist, then there's probably a problem in the PostgreSQL code
someplace, and you should try to figure out why we're getting the
wrong answer despite ICU apparently doing the right thing.
--
Robert Haas
EDB: http://www.enterprisedb.com
From | Date | Subject | |
---|---|---|---|
Next Message | Álvaro Herrera | 2025-09-26 13:22:48 | Re: refactor backend type lists |
Previous Message | Ashutosh Bapat | 2025-09-26 12:44:28 | Re: Report bytes and transactions actually sent downtream |