Re: ATTACH/DETACH PARTITION CONCURRENTLY

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Simon Riggs <simon(at)2ndquadrant(dot)com>
Subject: Re: ATTACH/DETACH PARTITION CONCURRENTLY
Date: 2018-11-08 16:34:44
Message-ID: CA+TgmoYg4x7AH=_QSptvuBKf+3hUdiCa4frPkt+RvXZyjX1n=w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Nov 7, 2018 at 1:37 PM Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> > Maybe you could give my patch a look.
>
> I have, a bit.

While thinking about this problem a bit more, I realized that what is
called RelationBuildPartitionDesc in master and BuildPartitionDesc in
Alvaro's patch has a synchronization problem as soon as we start to
reduce lock levels. At some point, find_inheritance_children() gets
called to get a list of the OIDs of the partitions. Then, later,
SysCacheGetAttr(RELOID, ...) gets called for each one to get its
relpartbound value. But since catalog lookups use the most current
snapshot, they might not see a compatible view of the catalogs.

That could manifest in a few different ways:

- We might see a newer version of relpartbound, where it's now null
because it's been detached.
- We might see a newer version of relpartbound where it now has an
unrelated value because it has been detached and then reattached to
some other partitioned table.
- We might see newer versions of relpartbound for some tables than
others. For instance, suppose we had partition A for 1..200 and B for
201..300. Then we realize that this is not what we actually wanted to
do, so we detach A and reattach it with a bound of 1..100 and detached
B and reattach it with a bound of 101..300. If we perform the
syscache lookup for A before this happens and the syscache lookup for
B after this happens, we might see the old bound for A and the new
bound for B, and that would be sad, 'cuz they overlap.
- Seeing an older relpartbound for some other table is also a problem
for other reasons -- we will have the wrong idea about the bounds of
that partition and may put the wrong tuples into it. Without
AccessExclusiveLock, I don't think there is anything that keeps us
from reading stale syscache entries.

Alvaro's patch defends against the first of these cases by throwing an
error, which, as I already said, I don't think is acceptable, but I
don't see any defense at all against the other cases. The root of the
problem is that the way catalog lookups work today - each individual
lookup uses the latest available snapshot, but there is zero guarantee
that consecutive lookups use the same snapshot. Therefore, as soon as
you start lowering lock levels, you are at risk for inconsistent data.

I suspect the only good way of fixing this problem is using a single
snapshot to perform both the scan of pg_inherits and the subsequent
pg_class lookups. That way, you know that you are seeing the state of
the whole partitioning hierarchy as it existed at some particular
point in time -- every commit is either fully reflected in the
constructed PartitionDesc or not reflected at all. Unfortunately,
that would mean that we can't use the syscache to perform the lookups,
which might have unhappy performance consequences.

Note that this problem still exists even if you allow concurrent
attach but not concurrent detach, but it's not as bad, because when
you encounter a concurrently-attached partition, you know it hasn't
also been concurrently-detached from someplace else. Presumably you
either see the latest value of the partition bound or the NULL value
which preceded it, but not anything else. If that's so, then maybe
you could get by without using a consistent snapshot for all of your
information gathering: if you see NULL, you know that the partition
was concurrently added and you just ignore it. There's still no
guarantee that all parallel workers would come to the same conclusion,
though, which doesn't feel too good.

Personally, I don't think it's right to blame that problem on parallel
query. The problem is more general than that: we assume that holding
any kind of a lock on a relation is enough to keep the important
details of the relation static, and therefore it's fine to do
staggered lookups within one backend, and it's also fine to do
staggered lookups across different backends. When you remove the
basic assumption that any lock is enough to prevent concurrent DDL,
then the whole idea that you can do different lookups at different
times with different snapshots (possibly in different backends) and
get sane answers also ceases to be correct. But the idea that you can
look up different bits of catalog data at whatever time is convenient
undergirds large amounts of our current machinery -- it's built into
relcache, syscache, sinval, ...

I think that things get even crazier if we postpone locking on
individual partitions until we need to do something with that
partition, as has been proposed elsewhere.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2018-11-08 16:38:09 Re: BUG #15212: Default values in partition tables don't work as expected and allow NOT NULL violation
Previous Message Amit Langote 2018-11-08 16:30:18 Re: BUG #15212: Default values in partition tables don't work as expected and allow NOT NULL violation