Re: Built-in CTYPE provider

From: Peter Eisentraut <peter(at)eisentraut(dot)org>
To: Jeff Davis <pgsql(at)j-davis(dot)com>, Daniel Verite <daniel(at)manitou-mail(dot)org>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Jeremy Schneider <schneider(at)ardentperf(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Built-in CTYPE provider
Date: 2024-02-07 09:53:36
Message-ID: 612987b5-38fa-44c6-8b78-c0be5a177a70@eisentraut.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Review of the v16 patch set:

(Btw., I suppose you started this patch series with 0002 because some
0001 was committed earlier. But I have found this rather confusing. I
think it's ok to renumber from 0001 for each new version.)

* v16-0002-Add-Unicode-property-tables.patch

Various comments are updated to include the term "character class". I
don't recognize that as an official Unicode term. There are categories
and properties. Let's check this.

Some files need heavy pgindent and perltidy. You were surely going to
do this eventually, but maybe you want to do this sooner to check
whether you like the results.

- src/common/unicode/Makefile

This patch series adds some new post-update-unicode tests. Should we
have a separate target for each or just one common "unicode test"
target? Not sure.

- .../generate-unicode_category_table.pl

The trailing commas handling ($firsttime etc.) is not necessary with
C99. The code can be simplified.

For this kind of code:

+print $OT <<"HEADER";

let's use a common marker like EOS instead of a different one for each
block. That just introduces unnecessary variations.

- src/common/unicode_category.c

The mask stuff at the top could use more explanation. It's impossible
to figure out exactly what, say, PG_U_PC_MASK does.

Line breaks in the different pg_u_prop_* functions are gratuitously
different.

Is it potentially confusing that only some pg_u_prop_* have a posix
variant? Would it be better for a consistent interface to have a
"posix" argument for each and just ignore it if not used? Not sure.

Let's use size_t instead of Size for new code.

* v16-0003-Add-unicode-case-mapping-tables-and-functions.patch

Several of the above points apply here analogously.

* v16-0004-Catalog-changes-preparing-for-builtin-collation-.patch

This is mostly a straightforward renaming patch, but there are some
changes in initdb and pg_dump that pre-assume the changes in the next
patch, like which locale columns apply for which providers. I think it
would be better for the historical record to make this a straight
renaming patch and move those semantic changes to the next patch (or a
separate intermediate patch, if you prefer).

- src/bin/psql/describe.c
- src/test/regress/expected/psql.out

This would be a good opportunity to improve the output columns for
collations. The updated view is now:

+ Schema | Name | Provider | Collate | Ctype | Locale | ICU Rules |
Deterministic?
+--------+------+----------+---------+-------+--------+-----------+----------------

This is historically grown but suboptimal. Why is Locale after Collate
and Ctype, and why does it show both? I think we could have just the
Locale column, and if the libc provider is used with different
collate/ctype (very rare!), we somehow write that into the single locale
column.

(A change like this would be a separate patch.)

* v16-0005-Introduce-collation-provider-builtin-for-C-and-C.patch

About this initdb --builtin-locale option and analogous options
elsewhere: Maybe we should flip this around and provide a --libc-locale
option, and have all the other providers just use the --locale option.
This would be more consistent with the fact that it's libc that is
special in this context.

Do we even need the "C" locale? We have established that "C.UTF-8" is
useful, but if that is easily available, who would need "C"?

Some changes in this patch appear to be just straight renamings, like in
src/backend/utils/init/postinit.c and
src/bin/pg_upgrade/t/002_pg_upgrade.pl. Maybe those should be put into
the previous patch instead.

On the collation naming: My expectation would have been that the
"C.UTF-8" locale would be exposed as the UCS_BASIC collation. And the
"C" locale as some other name (or not at all, see above). You have this
the other way around.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Erik Wienhold 2024-02-07 09:54:21 Re: Psql meta-command conninfo+
Previous Message Amit Kapila 2024-02-07 09:51:18 Re: Why is subscription/t/031_column_list.pl failing so much?