Re: ATTACH/DETACH PARTITION CONCURRENTLY

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
Cc: Michael Paquier <michael(at)paquier(dot)xyz>, Sergei Kornilov <sk(at)zsrv(dot)org>, Amit Langote <langote_amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Simon Riggs <simon(at)2ndquadrant(dot)com>
Subject: Re: ATTACH/DETACH PARTITION CONCURRENTLY
Date: 2019-01-28 16:14:58
Message-ID: CA+TgmoaAjkTibkEr=xJg3ndbRsHHSiYi2SJgX69MVosj=LJmug@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Jan 25, 2019 at 4:18 PM Alvaro Herrera <alvherre(at)2ndquadrant(dot)com> wrote:
> Right, the planner/executor "disconnect" is one of the challenges, and
> why I was trying to keep the old copy of the PartitionDesc around
> instead of building updated ones as needed.

I agree that would be simpler, but I just don't see how to make it
work. For one thing, keeping the old copy around can't work in
parallel workers, which never had a copy in the first place. For two
things, we don't have a really good mechanism to keep the
PartitionDesc that was used at plan time around until execution time.
Keeping the relation open would do it, but I'm pretty sure that causes
other problems; the system doesn't expect any residual references.

I know you had a solution to this problem, but I don't see how it can
work. You said "Snapshots have their own cache (hash table) of
partition descriptors. If a partdesc is requested and the snapshot has
already obtained that partdesc, the original one is returned -- we
don't request a new one from partcache." But presumably this means
when the last snapshot is unregistered, the cache is flushed
(otherwise, when?) and if that's so then this relies on the snapshot
that was used for planning still being around at execution time, which
I am pretty sure isn't guaranteed.

Also, and I think this point is worthy of some further discussion, the
thing that really seems broken to me about your design is the idea
that it's OK to use the query or transaction snapshot to decide which
partitions exist. The problem with that is that some query or
transaction with an old snapshot might see as a partition some table
that has been dropped or radically altered - different column types,
attached to some other table now, attached to same table but with
different bounds, or just dropped. And therefore it might try to
insert data into that table and fail in all kinds of crazy ways, about
the mildest of which is inserting data that doesn't match the current
partition constraint. I'm willing to be told that I've misunderstood
the way it all works and this isn't really a problem for some reason,
but my current belief is that not only is it a problem with your
design, but that it's such a bad problem that there's really no way to
fix it and we have to abandon your entire approach and go a different
route. If you don't think that's true, then perhaps we should discuss
it further.

> > I suppose the reason for this is so
> > that we don't have to go to the expense of copying the partition
> > bounds from the PartitionDesc into the final plan, but it somehow
> > seems a little scary to me. Perhaps I am too easily frightened, but
> > it's certainly a problem from the point of view of this project, which
> > wants to let the PartitionDesc change concurrently.
>
> Well, my definition of the problem started with the assumption that we
> would keep the partition array indexes unchanged, so "change
> concurrently" is what we needed to avoid. Yes, I realize that you've
> opted to change that definition.

I don't think I made a conscious decision to change this, and I'm kind
of wondering whether I have missed some better approach here. I feel
like the direction I'm pursuing is an inevitable consequence of having
no good way to keep the PartitionDesc around from plan-time to
execution-time, which in turn feels like an inevitable consequence of
the two points I made above: there's no guarantee that the plan-time
snapshot is still registered anywhere by the time we get to execution
time, and even if there were, the associated PartitionDesc may point
to tables that have been drastically modified or don't exist any more.
But it's possible that my chain-of-inevitable-consequences has a weak
link, in which case I would surely like it if you (or someone else)
would point it out to me.

> I may have forgotten some of your earlier emails on this, but one aspect
> (possibly a key one) is that I'm not sure we really need to cope, other
> than with an ERROR, with queries that continue to run across an
> attach/detach -- moreso in absurd scenarios such as the ones you
> described where the detached table is later re-attached, possibly to a
> different partitioned table. I mean, if we can just detect the case and
> raise an error, and this let us make it all work reasonably, that might
> be better.

Well, that's an interesting idea. I assumed that users would hate
that kind of behavior with a fiery passion that could never be
quenched. If not, then the problem changes from *coping* with
concurrent changes to *detecting* concurrent changes, which may be
easier, but see below.

> I think detaching partitions concurrently is a necessary part of this
> feature, so I would prefer not to go with a solution that works for
> attaching partitions but not for detaching them. That said, I don't see
> why it's impossible to adjust the partition maps in both cases. But I
> don't have anything better than hand-waving ATM.

The general problem here goes back to what I wrote in the third
paragraph of this email: a PartitionDesc that was built with a
particular snapshot can't be assumed to be usable after any subsequent
DDL has occurred that might affect the shape of the PartitionDesc.
For example, if somebody detaches a partition and reattaches it with
different partition bounds, and we use the old PartitionDesc for tuple
routing, we'll route tuples into that partition that do not satisfy
its partition constraint. And we won't even get an ERROR, because the
system assumes that any tuple which arrives at a partition as a result
of tuple routing must necessarily satisfy the partition constraint.

If only concurrent CREATE/ATTACH operations are allowed and
DROP/DETACH is not, then that kind of thing isn't possible. Any new
partitions which have shown up since the plan was created can just be
ignored, and the old ones must still have the same partition bounds
that they did before, and everything is fine. Or, if we're OK with a
less-nice solution, we could just ERROR out when the number of
partitions have changed. Some people will get errors they don't like,
but they won't end up with rows in their partitions that violate the
constraints.

But as soon as you allow concurrent DETACH, then things get really
crazy. Even if, at execution time, there are the same number of
partitions as I had at plan time, and even if those partitions have
the same OIDs as what I had at plan time, and even if those OIDs are
in the same order in the PartitionDesc, it does not prove that things
are OK. The partition could have been detached and reattached with a
narrower set of partition bounds. And if so, then we might route a
tuple to it that doesn't fit that narrower set of bounds, and there
will be no error, just database corruption.

I suppose one idea for handling this is to stick a counter into
pg_class or something that gets incremented every time a partition is
detached. At plan time, save the counter value; if it has changed at
execution time, ERROR. If not, then you have only added partitions to
worry about, and that's an easier problem.

But I kind of wonder whether we're really gaining as much as you think
by trying to support concurrent DETACH in the first place. If there
are queries running against the table, there's probably at least
AccessShareLock on the partition itself, not just the parent. And
that means that reducing the necessary lock on the parent from
AccessExclusiveLock to something less doesn't really help that much,
because we're still going to block trying to acquire
AccessExclusiveLock on the child. Now you might say: OK, well, just
reduce that lock level, too.

But that seems to me to be opening a giant can of worms which we are
unlikely to get resolved in time for this release. The worst of those
problem is that if you also reduce the lock level on the partition
when attaching it, then you are adding a constraint while somebody
might at the exact same time be inserting a tuple that violates that
constraint. Those two things have to be synchronized somehow. You
could avoid that by reducing the lock level on the partition when
detaching and not when attaching. But even then, detaching a
partition can involve performing a whole bunch of operations for which
we currently require AccessExclusiveLock. AlterTableGetLockLevel says:

/*
* Removing constraints can affect SELECTs that have been
* optimised assuming the constraint holds true.
*/
case AT_DropConstraint: /* as DROP INDEX */
case AT_DropNotNull: /* may change some SQL plans */
cmd_lockmode = AccessExclusiveLock;
break;

Dropping a partition implicitly involves dropping a constraint. We
could gamble that the above has no consequences that are really
serious enough to care about, and that none of the other subsidiary
objects that we adjust during detach (indexes, foreign keys, triggers)
really need AccessExclusiveLock now, and that none of the other kinds
of subsidiary objects that we might need to adjust in the future
during a detach will be changes that require AccessExclusiveLock
either, but that sounds awfully risky to me. We have very little DDL
that runs with less than AccessExclusiveLock, and I've already found
lots of subtle problems that have to be patched up just for the
particular case of allowing attach/detach to take a lesser lock on the
parent table, and I bet that there are a whole bunch more similar
problems when you start talking about weakening the lock on the child
table, and I'm not convinced that there are any reasonable solutions
to some of those problems, let alone that we can come up with good
solutions to all of them in the very near future.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2019-01-28 16:19:21 Header checking failures on LLVM-less machines
Previous Message Daniel Verite 2019-01-28 16:01:40 Re: Alternative to \copy in psql modelled after \g