Re: Move defaults toward ICU in 16?

From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Bruce Momjian <bruce(at)momjian(dot)us>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Michael Meskes <meskes(at)postgresql(dot)org>
Subject: Re: Move defaults toward ICU in 16?
Date: 2023-03-04 05:45:06
Message-ID: 438b15cf88c9c5cf0cf1c24bf7b835d6362a8e0c.camel@j-davis.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, 2023-03-02 at 10:37 +0100, Peter Eisentraut wrote:
> I would skip this.  There was a brief discussion about this at [0],
> where I pointed out that if we are going to do something like that,
> there would be other candidates among the optional dependencies to
> promote, such as certainly openssl and arguably lz4.  If we don't do
> this consistently across dependencies, then there will be confusion.

The difference is that ICU affects semantics of collations, and
collations are not really an optional feature. If postgres is built
without ICU, that will affect the default at initdb time (after patch
003, anyway), which will then affect the default collations in all
databases.

> In practice, I don't think it matters.  Almost all installations are
> made by packagers, who will make their own choices.

Mostly true, but the discussion at [0] reveals that some people do
build postgresql themselves for whatever reason.

When I first started out with postgres I always built from source. That
was quite a while ago, so maybe that means nothing; but it would be sad
to think that the build-it-yourself experience doesn't matter.

> >    0002: update template0 in new cluster (as described above)
>
> This makes sense.  I'm confused what the code originally wanted to
> achieve, e.g.,
>
> -/*
> - * Check that every database that already exists in the new cluster
> is
> - * compatible with the corresponding database in the old one.
> - */
> -static void
> -check_databases_are_compatible(void)
>
> Was there once support for the new cluster having additional
> databases
> in place?  Weird.

It looks like 33755e8edf was the last significant change here. CC
Heikki for comment.

> In any case, I think you can remove additional code from
> get_db_infos()
> related to fields that are no longer used, such as db_encoding, and
> the
> corresponding struct fields in DbInfo.

You're right: there's not much of an intersection between the code that
needs a DbInfo and the code that needs the locale fields. I created a
separate DbLocaleInfo struct for the template0 locale information, and
removed the locale fields from DbInfo.

I also added a TAP test.

> >    0003: default initdb to use ICU
>
> What are the changes in the citext tests about?

There's a test in citext_utf8.sql for:

SELECT 'i'::citext = 'İ'::citext AS t;

citext_eq uses DEFAULT_COLLATION_OID, comparing the results after
applying lower(). Apparently:

lower('İ' collate "en_US") = 'i' -- true
lower('İ' collate "tr-TR-x-icu") = 'i' -- true
lower('İ' collate "en-US-x-icu") = 'i' -- false

the test was passing before because it seems to be true for all libc
locales. But for ICU, it seems to only be true in the "tr-TR" locale at
colstrength=secondary (and true for other ICU locales at
colstrength=primary).

We can't fix the test by being explicit about the collation, because
citext doesn't pass it down; it always uses the default collation. We
could fix citext to pass it down properly, but that seems like a
different patch.

In any case, citext doesn't seem very important to those using ICU (we
have a doc note suggesting ICU instead), so I don't see a strong reason
to test the combination. So, I just exit the test early if it's ICU. I
added a better comment.

>   Is it the same issue as
> in unaccent?  In that case, the OR should be an AND?  Maybe add a
> comment?
>
> Why is unaccent is "broken" if the default collation is provided by
> ICU
> and LC_CTYPE=C?  Is that a general problem?  Should we prevent this
> combination?

A different issue: unaccent is calling t_isspace(), which is just not
properly handling locales. t_isspace() always passes NULL as the last
argument to char2wchar. There are TODO comments throughout that file.

Specifically what happens:
lc_ctype_is_c(DEFAULT_COLLATION_OID) returns false
so it calls char2wchar(), which calls mbstowcs()
which returns an error because the LC_CTYPE=C

Right now, that's a longstanding issue for all users of t_isspace() and
related functions: tsearch, ltree, pg_trgm, dict_xsyn, and unaccent. I
assume it was known and considered unimportant, otherwise we wouldn't
have left the TODO comments in there.

I believe it's only a problem when the provider is ICU and the LC_CTYPE
is C. I think a quick fix would be to just test LC_CTYPE directly (from
the environment or setlocale(LC_CTYPE, NULL)) rather than try to
extract it from the default collation. It sounds like a separate patch,
and should be handled as a bugfix and backported.

A better fix would be to support character classification in ICU. I
don't think that's hard, but ICU has quite a few options, and we'd need
to discuss which are the right ones to support. We may also want to
pass collation information down rather than just using the database
default, but that may depend on the caller and we should discuss that,
as well.

> What are the changes in the ecpg tests about?  Looks harmless, but if
> there is a need, maybe it should be commented somewhere, otherwise
> what
> prevents someone from changing it back?

ICU is not compatible with SQL_ASCII, so I had to remove the
ENCODING=SQL_ASCII line from the ecpg test build. CC Michael Meskes who
added the line in 1fa6be6f69 in case he has a comment.

But when I did that, I got CI failures on windows because it couldn't
transcode between LATIN1 and WIN1252. So I changed the ecpg test to
just use SQL_ASCII for the client_encoding (not the server encoding).
Michael Meskes added the client_encoding parameter test in 5e7710e725,
so he might have a comment about that as well.

Since I removed the code, I didn't see a clear place to add a comment,
but if you have a suggestion I'll take it.

--
Jeff Davis
PostgreSQL Contributor Team - AWS

Attachment Content-Type Size
v5-0003-Use-ICU-by-default-at-initdb-time.patch text/x-patch 18.6 KB
v5-0002-pg_upgrade-copy-locale-and-encoding-information-t.patch text/x-patch 18.9 KB
v5-0001-Build-ICU-support-by-default.patch text/x-patch 8.6 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2023-03-04 06:13:39 Re: pg_upgrade and logical replication
Previous Message Bharath Rupireddy 2023-03-04 04:17:05 Re: Simplify standby state machine a bit in WaitForWALToBecomeAvailable()