[19] Proposal: function markers to indicate collation/ctype sensitivity

From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: [19] Proposal: function markers to indicate collation/ctype sensitivity
Date: 2025-06-04 03:22:32
Message-ID: 8e9e8d680aabba2a537e1fe8bb04948b51283af9.camel@j-davis.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

=== Motivation ===

Right now, we are missing information about the dependencies between
database objects and collation providers, which makes it very hard to
migrate to a new collation provider version smoothly. Even assuming a
user is willing to REINDEX potentially-affected indexes, they have no
way to know which indexes may be affected, so they have to reindex
everything.

This proposal would add that dependency information, and importantly,
would be careful about which dependency entries are required for
particular expressions and which are not.

=== Proposal ===

(The terminology can be confusing here, so I use "CollOid" to mean the
collation object in pg_collation, and COLLATE as the marker keyword.)

Introduce three new options when creating or altering a function,
operator or index AM: COLLATE, CTYPE, or EQUALITY, representing the
operations that the object is sensitive to.

COLLATE marker: indicates that the function or index AM depends on the
CollOid of the input expression for ordering. Example: text_lt(), btree
AM.

CTYPE marker: indicates that the function or index AM depends on the
CollOid of the input expression for case conversion or character
classification. Examples: LOWER(), ILIKE.

EQUALITY marker: indicates that the function or index AM depends on
CollOid for the equality semantics of the input expression. Examples:
texteq(), btree AM, hash AM. (Note: EQUALITY is only important for non-
deterministic collations and we might want to omit it if too many
functions qualify, but it helps illustrate the rest of the proposal so
I will leave it in for now.)

When walking an expression to find dependencies (e.g. functions called
from an index expression), when you encounter a function call, then
look at the inferred CollOid of the input, and:

* if the COLLATE marker is present, record a dependency on that
inferred CollOid unless collate_is_c
* if the CTYPE marker is present, record a dependency on that
inferred CollOid unless ctype_is_c
* if the EQUALITY marker is present, record a dependency on that
inferred CollOid unless it's deterministic

NB: We must make the dependency entry even if CollOid or the function
itself IsPinnedObject()!

Similarly, when creating indexes, check the AM's markers, and add
dependencies for the column's CollOid as appropriate.

=== Benefits ===

* Markers will act as documentation.
* Dependency entries can be followed to find objects dependent on a
given collation, which will be useful during provider upgrades.
* Users of the builtin collation provider or libc where COLLATE=C (and
CTYPE is something else) would have precise dependency entries about
the specific expression indexes that may be affected by a provider
upgrade.
* If necessary, we can add runtime checks that will throw an error if a
function is missing the appropriate marker. There's some work to do
here, but I believe it's possible by checking at the time the collation
is opened.

Because it will add dependencies against the default collation, it
could bloat pg_depend substantially if !collate_is_c. However, in the
future when we have catalog entries for the providers themselves, that
would be useful information rather than just bloat: it would point to a
specific provider entry with a version (lots of details to work out in
a separate proposal).

=== Why separate markers? ===

Consider:

CREATE TABLE foo (t TEXT COLLATE PG_C_UTF8);
-- no additional dependencies created for SUBSTR, because it
-- has no markers
ALTER TABLE foo ADD CHECK (SUBSTR(t,1,3) = t);
-- dependency created for PG_C_UTF8
ALTER TABLE foo ADD CHECK (LOWER(t) = t);
-- no additional dependency created because
-- collate_is_c(PG_C_UTF8) is true
ALTER TABLE foo ADD CHECK (t < 'z');
-- no dependency because collate_is_c
CREATE INDEX foo_idx_1 ON foo USING btree (t);
-- creates a dependency because btree is marked with COLLATE
-- and !collate_is_c
CREATE INDEX foo_idx_2 ON foo USING btree (t COLLATE "en-x-icu");
-- does not create a dependency even though hash is marked
-- with EQUALITY, because "en-x-icu" is deterministic
CREATE INDEX foo_idx_3 ON foo USING hash (t COLLATE "en-x-icu");

If we didn't have separate markers, we'd need to assume that all of
those objects could be affected by a provider update.

=== NORMALIZE() ===

NORMALIZE() is a bit of a strange case because it doesn't depend on the
input collation at all, but we still want to record a dependency. We
have a number of potential solutions:

* Simply exclude it from IsPinnedObject(), which would create a
dependency against the normalize function itself rather than a
collation. That would complicate queries a bit and would not work for
UDFs that depend on our normalization tables.
* Record a dependency against a dummy collation that uses the builtin
provider.
* In the future, when we add catalog entries for the providers
themselves, we can record a dependency directly against the provider
without needing the dummy collation.

If we expect users to create their own functions which depend on our
normalization tables, we can add a fourth marker UNICODE. Otherwise, we
can just special case the few builtin functions we have to create those
dependency entries.

Regards,
Jeff Davis

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Shubham Khanna 2025-06-04 03:56:12 Enhance pg_createsubscriber to create required standby.
Previous Message jian he 2025-06-04 03:18:41 Re: Foreign key validation failure in 18beta1