From: | Markus Winand <markus(dot)winand(at)winand(dot)at> |
---|---|
To: | Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru> |
Cc: | PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: SQL/JSON path: collation for comparisons, minor typos in docs |
Date: | 2019-08-08 08:53:20 |
Message-ID: | A6A0BD39-E43F-4790-AE4C-338C7CBB0291@winand.at |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi!
The patch makes my tests pass.
I wonder about a few things:
- Isn’t there any code that could be re-used for that (the one triggered by ‘a’ < ‘A’ COLLATE ucs_basic)?
- For object key members, the standard also refers to unicode code point collation (SQL-2:2016 4.46.3, last paragraph).
- I guess it also applies to the “starts with” predicate, but I cannot find this explicitly stated in the standard.
My tests check whether those cases do case-sensitive comparisons. With my default collation "en_US.UTF-8” I cannot discover potential issues there. I haven’t played around with nondeterministic ICU collations yet :(
-markus
ps.: for me, testing the regular expression dialect of like_regex is out of scope
> On 8 Aug 2019, at 02:27, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru> wrote:
>
> On Thu, Aug 8, 2019 at 3:05 AM Alexander Korotkov
> <a(dot)korotkov(at)postgrespro(dot)ru <mailto:a(dot)korotkov(at)postgrespro(dot)ru>> wrote:
>> On Thu, Aug 8, 2019 at 12:55 AM Alexander Korotkov
>> <a(dot)korotkov(at)postgrespro(dot)ru> wrote:
>>> On Wed, Aug 7, 2019 at 4:11 PM Alexander Korotkov
>>> <a(dot)korotkov(at)postgrespro(dot)ru> wrote:
>>>> On Wed, Aug 7, 2019 at 2:25 PM Markus Winand <markus(dot)winand(at)winand(dot)at> wrote:
>>>>> I was playing around with JSON path quite a bit and might have found one case where the current implementation doesn’t follow the standard.
>>>>>
>>>>> The functionality in question are the comparison operators except ==. They use the database default collation rather then the standard-mandated "Unicode codepoint collation” (SQL-2:2016 9.39 General Rule 12 c iii 2 D, last sentence in first paragraph).
>>>>
>>>> Thank you for pointing! Nikita is about to write a patch fixing that.
>>>
>>> Please, see the attached patch.
>>>
>>> Our idea is to not sacrifice "==" operator performance for standard
>>> conformance. So, "==" remains per-byte comparison. For consistency
>>> in other operators we compare code points first, then do per-byte
>>> comparison. In some edge cases, when same Unicode codepoints have
>>> different binary representations in database encoding, this behavior
>>> diverges standard. In future we can implement strict standard
>>> conformance by normalization of input JSON strings.
>>
>> Previous version of patch has buggy implementation of
>> compareStrings(). Revised version is attached.
>
> Nikita pointed me that for UTF-8 strings per-byte comparison result
> matches codepoints comparison result. That allows simplify patch a
> lot.
>
> ------
> Alexander Korotkov
> Postgres Professional: http://www.postgrespro.com <http://www.postgrespro.com/>
> The Russian Postgres Company
> <0001-Use-Unicode-codepoint-collation-in-jsonpath-4.patch>
From | Date | Subject | |
---|---|---|---|
Next Message | Michael Paquier | 2019-08-08 09:22:31 | Re: Documentation clarification re: ANALYZE |
Previous Message | Kyotaro Horiguchi | 2019-08-08 08:43:11 | Re: Small patch to fix build on Windows |