Re: ATTACH/DETACH PARTITION CONCURRENTLY

From: Simon Riggs <simon(at)2ndquadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Sergei Kornilov <sk(at)zsrv(dot)org>, Amit Langote <langote_amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: ATTACH/DETACH PARTITION CONCURRENTLY
Date: 2019-01-29 05:28:57
Message-ID: CANP8+jK2nTRXLvzdn9KZ3sz+5tU19qDpVxMwxn=n6EZYC4+-MA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, 28 Jan 2019 at 20:15, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

> On Fri, Jan 25, 2019 at 4:18 PM Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
> wrote:
> > Right, the planner/executor "disconnect" is one of the challenges, and
> > why I was trying to keep the old copy of the PartitionDesc around
> > instead of building updated ones as needed.
>
> I agree that would be simpler, but I just don't see how to make it
> work. For one thing, keeping the old copy around can't work in
> parallel workers, which never had a copy in the first place. For two
> things, we don't have a really good mechanism to keep the
> PartitionDesc that was used at plan time around until execution time.
> Keeping the relation open would do it, but I'm pretty sure that causes
> other problems; the system doesn't expect any residual references.
>
> I know you had a solution to this problem, but I don't see how it can
> work. You said "Snapshots have their own cache (hash table) of
> partition descriptors. If a partdesc is requested and the snapshot has
> already obtained that partdesc, the original one is returned -- we
> don't request a new one from partcache." But presumably this means
> when the last snapshot is unregistered, the cache is flushed
> (otherwise, when?) and if that's so then this relies on the snapshot
> that was used for planning still being around at execution time, which
> I am pretty sure isn't guaranteed.
>
> Also, and I think this point is worthy of some further discussion, the
> thing that really seems broken to me about your design is the idea
> that it's OK to use the query or transaction snapshot to decide which
> partitions exist. The problem with that is that some query or
> transaction with an old snapshot might see as a partition some table
> that has been dropped or radically altered - different column types,
> attached to some other table now, attached to same table but with
> different bounds, or just dropped. And therefore it might try to
> insert data into that table and fail in all kinds of crazy ways, about
> the mildest of which is inserting data that doesn't match the current
> partition constraint. I'm willing to be told that I've misunderstood
> the way it all works and this isn't really a problem for some reason,
> but my current belief is that not only is it a problem with your
> design, but that it's such a bad problem that there's really no way to
> fix it and we have to abandon your entire approach and go a different
> route. If you don't think that's true, then perhaps we should discuss
> it further.
>
> > > I suppose the reason for this is so
> > > that we don't have to go to the expense of copying the partition
> > > bounds from the PartitionDesc into the final plan, but it somehow
> > > seems a little scary to me. Perhaps I am too easily frightened, but
> > > it's certainly a problem from the point of view of this project, which
> > > wants to let the PartitionDesc change concurrently.
> >
> > Well, my definition of the problem started with the assumption that we
> > would keep the partition array indexes unchanged, so "change
> > concurrently" is what we needed to avoid. Yes, I realize that you've
> > opted to change that definition.
>
> I don't think I made a conscious decision to change this, and I'm kind
> of wondering whether I have missed some better approach here. I feel
> like the direction I'm pursuing is an inevitable consequence of having
> no good way to keep the PartitionDesc around from plan-time to
> execution-time, which in turn feels like an inevitable consequence of
> the two points I made above: there's no guarantee that the plan-time
> snapshot is still registered anywhere by the time we get to execution
> time, and even if there were, the associated PartitionDesc may point
> to tables that have been drastically modified or don't exist any more.
> But it's possible that my chain-of-inevitable-consequences has a weak
> link, in which case I would surely like it if you (or someone else)
> would point it out to me.
>
> > I may have forgotten some of your earlier emails on this, but one aspect
> > (possibly a key one) is that I'm not sure we really need to cope, other
> > than with an ERROR, with queries that continue to run across an
> > attach/detach -- moreso in absurd scenarios such as the ones you
> > described where the detached table is later re-attached, possibly to a
> > different partitioned table. I mean, if we can just detect the case and
> > raise an error, and this let us make it all work reasonably, that might
> > be better.
>
> Well, that's an interesting idea. I assumed that users would hate
> that kind of behavior with a fiery passion that could never be
> quenched. If not, then the problem changes from *coping* with
> concurrent changes to *detecting* concurrent changes, which may be
> easier, but see below.
>
> > I think detaching partitions concurrently is a necessary part of this
> > feature, so I would prefer not to go with a solution that works for
> > attaching partitions but not for detaching them. That said, I don't see
> > why it's impossible to adjust the partition maps in both cases. But I
> > don't have anything better than hand-waving ATM.
>
> The general problem here goes back to what I wrote in the third
> paragraph of this email: a PartitionDesc that was built with a
> particular snapshot can't be assumed to be usable after any subsequent
> DDL has occurred that might affect the shape of the PartitionDesc.
> For example, if somebody detaches a partition and reattaches it with
> different partition bounds, and we use the old PartitionDesc for tuple
> routing, we'll route tuples into that partition that do not satisfy
> its partition constraint. And we won't even get an ERROR, because the
> system assumes that any tuple which arrives at a partition as a result
> of tuple routing must necessarily satisfy the partition constraint.
>
> If only concurrent CREATE/ATTACH operations are allowed and
> DROP/DETACH is not, then that kind of thing isn't possible. Any new
> partitions which have shown up since the plan was created can just be
> ignored, and the old ones must still have the same partition bounds
> that they did before, and everything is fine. Or, if we're OK with a
> less-nice solution, we could just ERROR out when the number of
> partitions have changed. Some people will get errors they don't like,
> but they won't end up with rows in their partitions that violate the
> constraints.
>
> But as soon as you allow concurrent DETACH, then things get really
> crazy. Even if, at execution time, there are the same number of
> partitions as I had at plan time, and even if those partitions have
> the same OIDs as what I had at plan time, and even if those OIDs are
> in the same order in the PartitionDesc, it does not prove that things
> are OK. The partition could have been detached and reattached with a
> narrower set of partition bounds. And if so, then we might route a
> tuple to it that doesn't fit that narrower set of bounds, and there
> will be no error, just database corruption.
>
> I suppose one idea for handling this is to stick a counter into
> pg_class or something that gets incremented every time a partition is
> detached. At plan time, save the counter value; if it has changed at
> execution time, ERROR. If not, then you have only added partitions to
> worry about, and that's an easier problem.
>

Yes, a version number would solve that issue.

> But I kind of wonder whether we're really gaining as much as you think
> by trying to support concurrent DETACH in the first place. If there
> are queries running against the table, there's probably at least
> AccessShareLock on the partition itself, not just the parent. And
> that means that reducing the necessary lock on the parent from
> AccessExclusiveLock to something less doesn't really help that much,
> because we're still going to block trying to acquire
> AccessExclusiveLock on the child. Now you might say: OK, well, just
> reduce that lock level, too.
>

The whole point of CONCURRENT detach is that you're not removing it whilst
people are still using it, you're just marking it for later disuse.

But that seems to me to be opening a giant can of worms which we are
> unlikely to get resolved in time for this release. The worst of those
> problem is that if you also reduce the lock level on the partition
> when attaching it, then you are adding a constraint while somebody
> might at the exact same time be inserting a tuple that violates that
> constraint.

Spurious.

This would only be true if we were adding a constraint that affected
existing partitions.

The constraint being added affects the newly added partition, not existing
ones.

> Those two things have to be synchronized somehow. You
> could avoid that by reducing the lock level on the partition when
> detaching and not when attaching. But even then, detaching a
> partition can involve performing a whole bunch of operations for which
> we currently require AccessExclusiveLock. AlterTableGetLockLevel says:
>
> /*
> * Removing constraints can affect SELECTs that have been
> * optimised assuming the constraint holds true.
> */
> case AT_DropConstraint: /* as DROP INDEX */
> case AT_DropNotNull: /* may change some SQL plans */
> cmd_lockmode = AccessExclusiveLock;
> break;
>
> Dropping a partition implicitly involves dropping a constraint. We
> could gamble that the above has no consequences that are really
>

It's not a gamble if you know that the constraints being dropped constrain
only the object being dropped.

> serious enough to care about, and that none of the other subsidiary
> objects that we adjust during detach (indexes, foreign keys, triggers)
> really need AccessExclusiveLock now, and that none of the other kinds
> of subsidiary objects that we might need to adjust in the future
> during a detach will be changes that require AccessExclusiveLock
> either, but that sounds awfully risky to me. We have very little DDL
> that runs with less than AccessExclusiveLock, and I've already found
> lots of subtle problems that have to be patched up just for the
> particular case of allowing attach/detach to take a lesser lock on the
> parent table, and I bet that there are a whole bunch more similar
> problems when you start talking about weakening the lock on the child
> table, and I'm not convinced that there are any reasonable solutions
> to some of those problems, let alone that we can come up with good
> solutions to all of them in the very near future.
>

I've not read every argument on this thread, but many of the later points
made here are spurious, by which I mean they sound like they could apply
but in fact do not.

--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Simon Riggs 2019-01-29 05:33:03 Re: Allowing extensions to supply operator-/function-specific info
Previous Message Michael Paquier 2019-01-29 05:19:09 Re: pg_basebackup, walreceiver and wal_sender_timeout