Re: Refactor to introduce pg_strcoll().

From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Refactor to introduce pg_strcoll().
Date: 2022-11-10 02:23:50
Message-ID: 8e521c93de49c6a944636aa2ae4ad7786f74e786.camel@j-davis.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Attached some new patches. I think you were right that the API of
pg_strcoll() was strange before, so I changed it to:

* pg_strcoll() takes nul-terminated arguments
* pg_strncoll() takes arguments and lengths

The new pg_strcoll()/pg_strncoll() api in 0001 seems much reasonable to
support in the long term, potentially with other callers.

Patch 0004 exports:

* pg_collate_icu() is for ICU and takes arguments and lengths
* pg_collate_libc() is for libc and takes nul-terminated arguments

for a tiny speedup because varstrfastcmp_locale() has both nul-
terminated arguments and their lengths.

On Fri, 2022-11-04 at 15:06 -0700, Jeff Davis wrote:
> But I think the complex hacks are just the transformation into UTF 16
> and calling of wcscoll(). If that's done from within pg_strcoll()
> (after my patch), then I think it just works, right?

I included a patch (0005) to enable varstrfastcmp_locale on windows
with a server encoding of UTF-8. I don't have a great way of testing
this, but it seems like it should work.

> I was also considering an optimization to use stack allocation for
> short strings when doing the non-UTF8 icu comparison. I am seeing
> some
> benefit there

I also included this optimization in 0003: try to use the stack for
reasonable values; allocate on the heap for larger strings. I think
it's helping a bit.

Patch 0002 helps a lot more: for the case of ICU with a non-UTF8 server
encoding, the speedup is something like 30%. The reason is that
icu_to_uchar() currently calls ucnv_toUChars() twice: once to determine
the precise output buffer size required, and then again once it has the
buffer ready. But it's easy to just size the destination buffer
conservatively, because the maximum space required is enough to hold
twice the number of UChars as there are input characters[1], plus the
terminating nul. That means we just call ucnv_toUChars() once, to do
the actual conversion.

My perf test was to use a quick hack to disable abbreviated keys (so
that it's doing more comparisons), and then just do a large ORDER BY
... COLLATE on a table with generated text. The text is random lorem
ipsum data, with some characters replaced by multibyte characters to
exercise some more interesting paths. If someone else has some more
sophisticated collation tests I'd be interested to try them.

Regards,
Jeff Davis

[1]
https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/ucnv_8h.html#ae1049fcb893783c860fe0f9d4da84939

Attachment Content-Type Size
v3-0001-Introduce-pg_strcoll-and-pg_strncoll.patch text/x-patch 13.5 KB
v3-0002-Refactor-icu_to_uchar.patch text/x-patch 5.6 KB
v3-0003-Optimize-buffer-allocation-for-pg_strcoll-pg_strn.patch text/x-patch 3.6 KB
v3-0004-Export-pg_collate_libc-and-pg_collate_icu.patch text/x-patch 4.9 KB
v3-0005-Allow-windows-to-use-varlenafastcmp_locale.patch text/x-patch 1.4 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Geoghegan 2022-11-10 02:35:12 Re: HOT chain validation in verify_heapam()
Previous Message Peter Geoghegan 2022-11-10 02:13:12 Re: HOT chain validation in verify_heapam()