Re: ATTACH/DETACH PARTITION CONCURRENTLY

From: David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Simon Riggs <simon(at)2ndquadrant(dot)com>
Subject: Re: ATTACH/DETACH PARTITION CONCURRENTLY
Date: 2018-08-21 06:00:28
Message-ID: CAKJS1f8WTyn+8q+0DHHCDDhrahNz7ckxz5Zxj8XMFi=_kq+yVQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 21 August 2018 at 13:59, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Mon, Aug 20, 2018 at 4:21 PM, Alvaro Herrera
> <alvherre(at)2ndquadrant(dot)com> wrote:
>> I wonder if this all stems from a misunderstanding of what I suggested
>> to David offlist. My suggestion was that the catalog scans would
>> continue to use the catalog MVCC snapshot, and that the relcache entries
>> would contain all the partitions that appear to the catalog; but each
>> partition's entry would carry the Xid of the creating transaction in a
>> field (say xpart), and that field is compared to the regular transaction
>> snapshot: if xpart is visible to the transaction snapshot, then the
>> partition is visible, otherwise not. So you never try to access a
>> partition that doesn't exist, because those just don't appear at all in
>> the relcache entry. But if you have an old transaction running with an
>> old snapshot, and the partitioned table just acquired a new partition,
>> then whether the partition will be returned as part of the partition
>> descriptor or not depends on the visibility of its entry.
>
> Hmm. One question is where you're going to get the XID of the
> creating transaction. If it's taken from the pg_class row or the
> pg_inherits row or something of that sort, then you risk getting a
> bogus value if something updates that row other than what you expect
> -- and the consequences of that are pretty bad here; for this to work
> as you intend, you need an exactly-correct value, not newer or older.
> An alternative is to add an xid field that stores the value
> explicitly, and that might work, but you'll have to arrange for that
> value to be frozen at the appropriate time.
>
> A further problem is that there could be multiple changes in quick
> succession. Suppose that a partition is attached, then detached
> before the attach operation is all-visible, then reattached, perhaps
> with different partition bounds.

I should probably post the WIP I have here. In those, I do have the
xmin array in the PartitionDesc. This gets taken from the new
pg_partition table, which I don't think suffers from the same issue as
taking it from pg_class, since nothing else will update the
pg_partition record.

However, I don't think the xmin array is going to work if we include
it in the PartitionDesc. The problem is, as I discovered from writing
the code was that the PartitionDesc must remain exactly the same
between planning an execution. If there are any more or any fewer
partitions found during execution than what we saw in planning then
run-time pruning will access the wrong element in the
PartitionPruneInfo array, or perhaps access of the end of the array.
It might be possible to work around that by identifying partitions by
Oid rather than PartitionDesc array index, but the run-time pruning
code is already pretty complex. I think coding it to work when the
PartitionDesc does not match between planning and execution is just
going to too difficult to get right. Tom is already unhappy with the
complexity of ExecFindInitialMatchingSubPlans().

I think the solution will require that the PartitionDesc does not:

a) Change between planning and execution.
b) Change during a snapshot after the partitioned table has been locked.

With b, it sounds like we'll need to take the most recent
PartitionDesc even if the transaction is older than the one that did
the ATTACH/DETACH operation as if we use an old version then, as
Robert mentions, there's nothing to stop another transaction making
changes to the table that make it an incompatible partition, e.g DROP
COLUMN. This wouldn't be possible if we update the PartitionDesc
right after taking the first lock on the partitioned table since any
transactions doing DROP COLUMN would be blocked until the other
snapshot gets released.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2018-08-21 06:22:50 Re: NLS handling fixes.
Previous Message Andrew Gierth 2018-08-21 05:47:47 Re: A really subtle lexer bug