Re: [PATCH] Partial foreign key updates in referential integrity triggers

From: Paul Martinez <hellopfm(at)gmail(dot)com>
To: Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>
Cc: Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com>, Zhihong Yu <zyu(at)yugabyte(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PATCH] Partial foreign key updates in referential integrity triggers
Date: 2021-12-02 00:17:20
Message-ID: CAF+2_SH8Lg7qz-Rz9HNU9aHvjiu8Db4ftYdJEhpdwC1LvgDsrQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Nov 24, 2021 at 10:59 AM Peter Eisentraut
<peter(dot)eisentraut(at)enterprisedb(dot)com> wrote:
>
> I was looking through this example to see if it could be adapted for the
> documentation.
>
> The way the users table is defined, it appears that "id" is actually
> unique and the primary key ought to be just (id). The DELETE command
> you show also just uses the id column to find the user, which would be
> bad if the user id is not unique across tenants. If the id were unique,
> then the foreign key from posts to users would just use the user id
> column and the whole problem of the ON DELETE SET NULL action would go
> away. If the primary key of users is indeed supposed to be (tenant_id,
> id), then maybe the definition of the users.id column should not use
> serial, and the DELETE command should also look at the tenant_id column.
> (The same question applies to posts.id.)
>
> Also, you initially wrote that this is a denormalized schema. I think
> if we keep the keys the way you show, then this isn't denormalized. But
> if we considered users.id globally unique, then there would be
> normalization concerns.
>
> What do you think?

Regarding that specific example, in a production scenario, yes, the
DELETE command should reference both columns. And if used for
documentation both columns should be referenced for clarity/correctness.

I don't think the exact semantics regarding the uniqueness of the id
column are critical. Configuring separate auto-incrementing ids per
tenant would be fairly complex; practically speaking, a single
database with multi-tenant data will use serial to get auto-incrementing
ids (or else use UUIDs to prevent conflicts). The possibility of
conflicting ids likely won't arise until moving to a distributed
environment, at which point queries should only be routed towards a
single shard (where uniqueness will still hold), either by some higher
level application-level context, or by including the tenant_id as part
of the query.

I think there are three separate motivating use cases for using
(tenant_id, id) as primary keys everywhere in a multi-tenant database:

1) I initially encountered this problem while migrating a database to use
Citus, which requires that primary keys (and any other uniqueness
constraints) include the shard key, which forces the primary key to be
(tenant_id, id). I'm not sure what constraints other sharding solutions
enforce, but I don't feel like this feature is over-fitting to Citus'
specific implementation -- it seems like a pretty
reasonable/generalizable solution when sharding data: prefix all your
indexes with the shard key.

2) As I mentioned in my response to Tom in my original proposal thread,
and as Matthias alluded to, using composite primary keys grants
significantly stronger referential integrity by preventing cross-tenant
references. I think this represents a significant leap in the robustness
and security of a schema, to the point where you could consider it a
design flaw to _not_ use composite keys.

https://www.postgresql.org/message-id/flat/CAF%2B2_SFFCjWMpxo0cj3yaqMavcb3Byd0bSG%2B0UPs7RVb8EF99g%40mail.gmail.com#c0e2b37b223bfbf8ece561f02865286c

3) For performance reasons, indexes on foreign keys will often be
prefixed by the tenant_id to speed up index scans. (I think
algorithmically doing an index lookup on (fk_id) vs. (tenant_id, fk_id)
has the same complexity, but repeated index scans, such as when doing a
join, should in practice be more efficient when including a tenant_id,
because most queries will only reference a single tenant so the looked
up values are more likely to be on the same pages.) If a foreign key
only references the id column, then ON DELETE CASCADE triggers will only
use the id column in their DELETE query. Thus, to ensure that deletes
are still fast, you will need to create an index on (fk_id) in addition
to the (tenant_id, fk_id) index, which would cause _significant_
database bloat. (In practice, the presence of both indexes will also
confuse the query planner and now BOTH indexes will take up precious
space in the database's working memory, so it really creates all sorts
of problems.) Using a composite foreign key will ensure that ON DELETE
CASCADE trigger query will use both columns.

- Paul

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bossart, Nathan 2021-12-02 00:19:19 Re: O(n) tasks cause lengthy startups and checkpoints
Previous Message Justin Pryzby 2021-12-01 23:59:15 Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)