Re: Collation version tracking for macOS

From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Joe Conway <mail(at)joeconway(dot)com>
Cc: Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com>, Jeremy Schneider <schneider(at)ardentperf(dot)com>, Peter Geoghegan <pg(at)bowt(dot)ie>, "Nasby, Jim" <nasbyj(at)amazon(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Collation version tracking for macOS
Date: 2022-12-08 05:56:38
Message-ID: c4fda90ec6a7568a896f243a38eb273c3b5c3d93.camel@j-davis.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, 2022-12-06 at 10:33 +1300, Thomas Munro wrote:
> OK, I'm going to see what happens if I try to wrangle that stuff into
> a new catalogue table.

I've been hacking on a major refactor of the locale-related code. I
attached my progress and I think several patches are ready.

The main motivation is that I was frustrated by the special cases
everywhere. I wanted to make it easier to hack on this code going
forward, now that we are adding even more complexity for multiple ICU
libraries.

I'm posting to this thread (rather than my previous refactoring
thread[1]) because a lot of the people interested in working on this
code are here. So, if you like (or don't like) the structure of these
changes, please let me know.

Changes:
* Introduce pg_locale_internal.h to hide all USE_ICU code,
including all callers of the ICU routines. The files that
still need to include pg_locale_internal.h are:
- pg_locale.c
- regc_pg_locale.c
- formatting.c
- like.c
- like_support.c
- collationcmds.c
* Other callers (in files that don't include 
pg_locale_internal.h) don't need to branch based on the
provider, platform, database encoding, USE_ICU,
HAVE_LOCALE_T, etc.
* ICU and libc are treated the same way in more places.
* I made it so pg_locale_t is constructed first, then
moved to TopMemoryContext, so that it won't leak in
TopMemoryContext if errors are encountered.
* Introduce pg_strcoll, pg_strncoll, pg_strxfrm, and pg_strnxfrm
so that varlena/hash/verchar code doesn't worry about the
details.
* Add method structure pg_icu_library, borrowed from Thomas's
patch, that provides one convenient place to provide
multiple-ICU-library support.
* Add a hook that allows you to fill in the pg_icu_library
structure however you want while a pg_locale_t is being
constructed. This allows do-it-yourself ICU library
lockdown.

On the negative side, it increases the line count. Part of that is
because adding indirection for the ICU library is just more lines of
code, but a lot of it is just that I used a lot of smaller functions.
Perhaps my style is a bit verbose?

Even though we're close to consensus on how we should offer control
over the ICU libraries, having the hook may be useful for
experimentation, testing, or as a last resort. Right now the hook has
limited information to use to find the right library -- just the ICU
collation name and the version, because that's what we have in the
catalog. But I assume the patch Thomas is working on will change that.

Performance:

I did brief performance sanity tests on several paths and the results
are unremarkable (which is generally good for a refactor). On the path
I was watching most closely, ICU/UTF8/en-US-x-icu, it came in about 2%
faster, which was a pleasant surprise. This was true both when I
disabled abbreviated keys (to stress localized comparison paths) and
also with abbreviated keys enabled. My previous refactoring work[1]
ended up a percent or two slower. My guess right now is that I moved
some code around after I noticed that ICU accepts NUL-terminated
strings (by specifying the lenght as -1), and that helped. But I'll
need to profile and look more closely to be more certain of my results,
these are preliminary.

There are a few things that could be done differently:

* I am still a bit confused about why someone would want a collation
with a different lc_collate and lc_ctype in libc; and assuming there is
a reason, why it can't be done with ICU. The way I did the refactoring
tries to accommodate them as different concepts, but I can rip that
out.

* In theory, we could also support multilib libc, and an associated
get_libc_library() and hook, but there are a couple big challenges.
Firstly, if it's the default locale, it relies on setlocale(), so we'd
have to figure out what to do about that. Second, having an a second
version of glibc on your system is not as normal or trivial as having a
second version of ICU.

* I made the hook simple, but all it can do is replace the ICU
library. It's possible that it would want to construct it's own entire
pg_locale_t for some reason, and keep more complex state in a private
pointer, or something like that. It seemed better to keep it simple,
but maybe someone would want more flexibility there?

* I used the library indirection for pretty much all ICU calls,
including the ucnv_ and the uloc_ functions. I did this mainly because,
if we are so paranoid about ICU changing in subtle ways, we might as
well make it possible to lock down everything. I can rip this out, too,
but it didn't add many lines.

Loose ends:

* I need to do something with get_collation_actual_version. I had an
earlier iteration that went through pg_newlocale() and then queried the
resulting pg_locale_t structure, but that changed the error paths in a
way that failed a couple tests, so I left that out.

* Error paths could be improved further to make sure that libc
locale_t and UCollator structures are freed in error paths during
construction. I was thinking about using resowner for this, and then if
the pg_locale_t structure gets moved to TopMemoryContext, just doing a
ResourceOwnerForget. Alternatively, I could just be careful about the
error paths.

* We'd need to adapt this and make sure it works with whatever scheme
we decide is best for finding the right library. I suspect this would
just be adding another parameter to get_icu_library (and the hook) to
represent the new collation provider Oid (and get_icu_library could use
that to look up the library names and load them).

Comments welcome.

[1]
https://www.postgresql.org/message-id/99aa79cceefd1fe84fda23510494b8fbb7ad1e70.camel@j-davis.com

--
Jeff Davis
PostgreSQL Contributor Team - AWS

Attachment Content-Type Size
v2-0001-Add-pg_strcoll-and-pg_strncoll.patch text/x-patch 19.3 KB
v2-0002-Add-pg_strxfrm-and-pg_strxfrm_prefix.patch text/x-patch 24.2 KB
v2-0003-Refactor-pg_locale_t-routines.patch text/x-patch 45.7 KB
v2-0004-Add-method-structure-toward-ICU-multi-library-sup.patch text/x-patch 28.8 KB
v2-0005-Add-get_icu_library_hook.patch text/x-patch 6.2 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bharath Rupireddy 2022-12-08 06:59:54 Re: WAL Insertion Lock Improvements (was: Re: Avoid LWLockWaitForVar() for currently held WAL insertion lock in WaitXLogInsertionsToFinish())
Previous Message David G. Johnston 2022-12-08 05:48:16 Re: ANY_VALUE aggregate