Allow foreign keys to reference a superset of unique columns

From: Kaiting Chen <ktchen14(at)gmail(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Cc: hellopfm(at)gmail(dot)com
Subject: Allow foreign keys to reference a superset of unique columns
Date: 2022-06-07 18:59:24
Message-ID: CA+CLzG8HZUk8Gb9BKN88fgdSEqHx=2iq5aDdvbz7JqSFjA2WxA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi everyone:

I'd like to propose a change to PostgreSQL to allow the creation of a foreign
key constraint referencing a superset of uniquely constrained columns.

As it currently stands:

CREATE TABLE foo (a integer PRIMARY KEY, b integer);
CREATE TABLE bar (x integer, y integer,
FOREIGN KEY (x, y) REFERENCES foo(a, b));
> ERROR: there is no unique constraint matching given keys for referenced
table "foo"

Despite the fact that in "foo", the combination of columns (a, b) is guaranteed
to be unique by virtue of being a superset of the primary key (a).

This capability has been requested before, at least once in 2004 [1] and again
in 2021 [2].

To illustrate when it'd be useful to define such a foreign key constraint,
consider a database that will store graphs (consisting of nodes and edges) where
graphs are discrete and intergraph edges are prohibited:

CREATE TABLE graph (
id INTEGER PRIMARY KEY GENERATED BY DEFAULT AS IDENTITY
);

CREATE TABLE node (
id INTEGER PRIMARY KEY GENERATED BY DEFAULT AS IDENTITY,
graph_id INTEGER NOT NULL REFERENCES graph(id)
);

CREATE TABLE edge (
graph_id INTEGER,
source_id INTEGER,
FOREIGN KEY (source_id, graph_id) REFERENCES node(id, graph_id),

target_id INTEGER,
FOREIGN KEY (target_id, graph_id) REFERENCES node(id, graph_id),

PRIMARY KEY (graph_id, source_id, target_id)
);

This schema is unsupported by PostgreSQL absent this constraint:

ALTER TABLE node ADD UNIQUE (id, graph_id);

However, this constraint, as well as its underlying unique index, is superfluous
as node(id) itself is unique. Its addition serves no semantic purpose but incurs
cost of additional on disk storage and update time. Note the prohibition of
intergraph edges isn't enforceable on "edge" without the composite foreign keys
(or triggers).

An alternative approach is to redefine node's PRIMARY KEY as (id, graph_id).
However, this would force every table referring to "node" to duplicate both
columns into their schema, even when a singular "node_id" would suffice. This is
undesirable if there are many tables referring to "node" that have no such
intergraph restrictions and few that do.

While it can be argued that this schema contains some degree of denormalization,
it isn't uncommon and a recent patch was merged to support exactly this kind of
design [3].

In that case, the SET NULL and SET DEFAULT referential actions gained support
for an explicit column list to accomodate this type of design.

A problem evinced by Tom Lane in the 2004 discussion on this was that, were this
permitted, the index supporting a foreign key constraint could be ambiguous:

> I think one reason for this is that otherwise it's not clear which
> unique constraint the FK constraint depends on. Consider
>
> create table a (f1 int unique, f2 int unique);
>
> create table b (f1 int, f2 int,
> foreign key (f1,f2) references a(f1,f2));
>
> How would you decide which constraint to make the FK depend on?
> It'd be purely arbitrary.

I propose that this problem is addressed, with an extension to the SQL standard,
as follows in the definition of a foreign key constraint:

1. Change the restriction on the referenced columns in a foreign key constraint
to:

The referenced columns must be a *superset* (the same, or a strict superset)
of the columns of a non-deferrable unique or primary key index on the
referenced table.

2. The FOREIGN KEY constraint syntax gains a [ USING INDEX index_name ] clause
optionally following the referenced column list.

The index specified by this clause is used to support the foreign key
constraint, and it must be a non-deferrable unique or primary key index on
the referenced table compatible with the referenced columns.

Here, compatible means that the columns of the index are a subset (the same,
or a strict subset) of the referenced columns.

3. If the referenced columns are the same as the columns of such a unique (or
primary key) index on the referenced table, then the behavior in PostgreSQL
doesn't change.

4. If the referenced columns are a strict superset of the columns of such an
index on the referenced table, then:
1. If the primary key of the referenced table is a strict subset of the
referenced columns, then its index is used to support the foreign key if
no USING INDEX clause is present.
2. Otherwise, the USING INDEX clause is required.

I believe that this scheme is unambiguous and should stably round trip a dump
and restore. In my previous example, the foreign key contraints could then be
defined as:

FOREIGN KEY (source_id, graph_id) REFERENCES node(id, graph_id),
FOREIGN KEY (target_id, graph_id) REFERENCES node(id, graph_id),

Or alternatively:

FOREIGN KEY (source_id, graph_id) REFERENCES node(id, graph_id)
USING INDEX node_pkey,
FOREIGN KEY (target_id, graph_id) REFERENCES node(id, graph_id)
USING INDEX node_pkey,

Also, the addition of a USING INDEX clause may be useful in its own right. It's
already possible to create multiple unique indexes on a table, with the same set
of columns, but differing storage parameters of index tablespaces. In this
situation, when multiple indexes can support a foreign key constraint, which
index is chosen appears to be determined in OID order. This clause would allow a
user to unambiguously specify which index to use.

I've attached a first draft patch that implements what I've described and would
love some feedback. For all referenced columns that aren't covered by the chosen
index, it chooses the opclass to use by finding the default opclass for the
column's type for the same access method as the chosen index. (Would it be
useful to allow the specification of opclasses in the referenced column list of
a foreign key constraint, similar to the column list in CREATE INDEX?)

Also, in pg_get_constraintdef_worker(), it adds a USING INDEX clause only when a
foreign key constraint is supported by an index that:

1. Isn't a primary key, and
2. Has a different number of key columns than the number of constrained columns

I *think* that this should ensure that a USING INDEX clause is added only when
required.

[1] https://www.postgresql.org/message-id/flat/CAF%2B2_SGhbc6yGUoNFjDOgjq1VpKpE5WZfOo0M%2BUwcPH%3DmddNMg%40mail.gmail.com
[2] https://www.postgresql.org/message-id/flat/1092734724.2627.4.camel%40dicaprio.akademie1.de
[3] https://www.postgresql.org/message-id/flat/CACqFVBZQyMYJV%3DnjbSMxf%2BrbDHpx%3DW%3DB7AEaMKn8dWn9OZJY7w%40mail.gmail.com

Attachment Content-Type Size
fk-unique-superset.diff application/octet-stream 17.6 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2022-06-07 19:14:43 Re: Invalid memory access in pg_stat_get_subscription
Previous Message Peter Geoghegan 2022-06-07 18:53:28 Re: An inverted index using roaring bitmaps