Re: ATTACH/DETACH PARTITION CONCURRENTLY

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Simon Riggs <simon(at)2ndquadrant(dot)com>
Subject: Re: ATTACH/DETACH PARTITION CONCURRENTLY
Date: 2018-11-06 22:09:59
Message-ID: CA+Tgmoaeix0L+7Vtt0EgwQE2dVo4MUE8MnpHX0fJ1=2Pbg1LQg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Aug 7, 2018 at 9:29 AM Andres Freund <andres(at)anarazel(dot)de> wrote:
> One approach would be to make sure that everything relying on
> rt_partdesc staying the same stores its value in a local variable, and
> then *not* free the old version of rt_partdesc (etc) when the refcount >
> 0, but delay that to the RelationClose() that makes refcount reach
> 0. That'd be the start of a framework for more such concurrenct
> handling.

Some analysis of possible trouble spots:

- get_partition_dispatch_recurse and ExecCreatePartitionPruneState
both call RelationGetPartitionDesc. Presumably, this means that if
the partition descriptor gets updated on the fly, the tuple routing
and partition dispatch code could end up with different ideas about
which partitions exist. I think this should be fixed somehow, so that
we only call RelationGetPartitionDesc once per query and use the
result for everything.

- expand_inherited_rtentry checks
RelationGetPartitionDesc(oldrelation) != NULL. If so, it calls
expand_partitioned_rtentry which fetches the same PartitionDesc again.
We can probably just do this once in the caller and pass the result
down.

- set_relation_partition_info also calls RelationGetPartitionDesc.
Off-hand, I think this code runs after expand_inherited_rtentry. Not
sure what to do about this. I'm not sure what the consequences would
be if this function and that one had different ideas about the
partition descriptor.

- tablecmds.c is pretty free about calling RelationGetPartitionDesc
repeatedly, but it probably doesn't matter. If we're doing some kind
of DDL that depends on the contents of the partition descriptor, we
*had better* be holding a lock strong enough to prevent the partition
descriptor from being changed by somebody else at the same time.
Allowing a partition to be added concurrently with DML is one thing;
allowing a partition to be added concurrently with adding another
partition is a whole different level of insanity. I think we'd be
best advised not to go down that rathole - among other concerns, how
would you even guarantee that the partitions being added didn't
overlap?

Generally:

Is it really OK to postpone freeing the old partition descriptor until
the relation reference count goes to 0? I wonder if there are cases
where this could lead to tons of copies of the partition descriptor
floating around at the same time, gobbling up precious cache memory.
My first thought was that this would be pretty easy: just create a lot
of new partitions one by one while some long-running transaction is
open. But the actual result in that case depends on the behavior of
the backend running the transaction. If it just ignores the new
partitions and sticks with the partition descriptor it has got, then
probably nothing else will request the new partition descriptor either
and there will be no accumulation of memory. However, if it tries to
absorb the updated partition descriptor, but without being certain
that the old one can be freed, then we'd have a query-lifespan memory
leak which is quadratic in the number of new partitions.

Maybe even that would be OK -- we could suppose that the number of new
partitions would probably be all THAT crazy large, and the constant
factor not too bad, so maybe you'd leak a could of MB for the length
of the query, but no more. However, I wonder if it would better to
give each PartitionDescData its own refcnt, so that it can be freed
immediately when the refcnt goes to zero. That would oblige every
caller of RelationGetPartitionDesc() to later call something like
ReleasePartitionDesc(). We could catch failures to do that by keeping
all the PartitionDesc objects so far created in a list. When the main
entry's refcnt goes to 0, cross-check that this list is empty; if not,
then the remaining entries have non-zero refcnts that were leaked. We
could emit a WARNING as we do in similar cases.

In general, I think something along the lines you are suggesting here
is the right place to start attacking this problem. Effectively, we
immunize the system against the possibility of new entries showing up
in the partition descriptor while concurrent DML is running; the
semantics are that the new partitions are ignored for the duration of
currently-running queries. This seems to allow for painless creation
or addition of new partitions in normal cases, but not when a default
partition exists. In that case, using the old PartitionDesc is
outright wrong, because adding a new toplevel partition changes the
default partition's partition constraint. We can't insert into the
default partition a tuple that under the updated table definition
needs to go someplace else. It seems like the best way to account for
that is to reduce the lock level on the partitioned table to
ShareUpdateExclusiveLock, but leave the lock level on any default
partition as AccessExclusiveLock (because we are modifying a
constraint on it). We would also need to leave the lock level on the
new partition as AccessExclusiveLock (because we are adding a
constraint on it). Not perfect, for sure, but not bad for a first
patch, either; it would improve things for users in a bunch of
practical cases.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tomas Vondra 2018-11-06 22:11:29 Re: backend crash on DELETE, reproducible locally
Previous Message Andres Freund 2018-11-06 21:54:57 Re: backend crash on DELETE, reproducible locally