Re: ATTACH/DETACH PARTITION CONCURRENTLY

From: David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>
To: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>
Subject: Re: ATTACH/DETACH PARTITION CONCURRENTLY
Date: 2018-08-07 12:40:12
Message-ID: CAKJS1f8AdQXedNmznbTWr4VZjNPDSrJLjOyF_FJNV_e9sivJng@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 3 August 2018 at 01:25, David Rowley <david(dot)rowley(at)2ndquadrant(dot)com> wrote:
> 1. Do all the normal partition attach partition validation.
> 2. Insert a record into pg_partition with partisvalid=false
> 3. Obtain a session-level ShareUpdateExclusiveLock on the partitioned table.
> 4. Obtain a session-level AccessExclusiveLock on the partition being attached.
> 5. Commit.
> 6. Start a new transaction.
> 7. Wait for snapshots older than our own to be released.
> 8. Mark the partition as valid
> 9. Invalidate relcache for the partitioned table.
> 10. release session-level locks.

So I was thinking about this again and realised this logic is broken.
All it takes is a snapshot that starts after the ATTACH PARTITION
started and before it completed. This snapshot will have the new
partition attached while it's possibly still open which could lead to
non-repeatable reads in a repeatable read transaction. The window for
this to occur is possibly quite large given that the ATTACH
CONCURRENTLY can wait a long time for older snapshots to finish.

Here's my updated thinking for an implementation which seems to get
around the above problem:

ATTACH PARTITION CONCURRENTLY:

1. Obtain a ShareUpdateExclusiveLock on the partitioned table rather
than an AccessExclusiveLock.
2. Do all the normal partition attach partition validation.
3. Insert pg_partition record with partvalid = true.
4. Invalidate relcache entry for the partitioned table
5. Any loops over a partitioned table's PartitionDesc must check
PartitionIsValid(). This will return true if the current snapshot
should see the partition or not. The partition is valid if partisvalid
= true and the xmin precedes or is equal to the current snapshot.

#define PartitionIsValid(pd, i) (((pd)->is_valid[(i)] \
&& TransactionIdPrecedesOrEquals((pd)->xmin[(i)], GetCurrentTransactionId())) \
|| (!(pd)->is_valid[(i)] \
&& TransactionIdPrecedesOrEquals(GetCurrentTransactionId(), (pd)->xmin[(i)])))

DETACH PARTITION CONCURRENTLY:

1. Obtain ShareUpdateExclusiveLock on partition being detached
(instead of the AccessShareLock that non-concurrent detach uses)
2. Update the pg_partition record, set partvalid = false.
3. Commit
4. New transaction.
5. Wait for transactions which hold a snapshot older than the one held
when updating pg_partition to complete.
6. Delete the pg_partition record.
7. Perform other cleanup, relpartitionparent = 0, pg_depend etc.

DETACH PARTITION CONCURRENTLY failure when it fails after step 3 (above)

1. Make vacuum of a partition check for pg_partition.partvalid =
false, if xmin of tuple is old enough, perform a partition cleanup by
doing steps 6+7 above.

A VACUUM FREEZE must run before transaction wraparound, so this means
a partition can never reattach itself when the transaction counter
wraps.

I believe I've got the attach and detach working correctly now and
also isolation tests that appear to prove it works. I've also written
the failed detach cleanup code into vacuum. Unusually, since foreign
tables can also be partitions this required teaching auto-vacuum to
look at foreign tables, only in the sense of checking for failed
detached partitions. It also requires adding vacuum support for
foreign tables too. It feels a little bit weird to modify auto-vacuum
to look at foreign tables, but I really couldn't see another way to do
this.

I'm now considering if this all holds together in the event the
pg_partition tuple of an invalid partition becomes frozen. The problem
would be that PartitionIsValid() could return the wrong value due to
TransactionIdPrecedesOrEquals(GetCurrentTransactionId(),
(pd)->xmin[(i)]). this code is trying to keep the detached partition
visible to older snapshots, but if pd->xmin[i] becomes frozen, then
the partition would become invisible. However, I think this won't be
a problem since a VACUUM FREEZE would only freeze tuples that are also
old enough to have failed detaches cleaned up earlier in the vacuum
process.

Also, we must disallow a DEFAULT partition from being attached to a
partition with a failed DETACH CONCURRENTLY as it wouldn't be very
clear what the default partition's partition qual would be, as this is
built based on the quals of all attached partitions.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2018-08-07 12:47:51 Re: ATTACH/DETACH PARTITION CONCURRENTLY
Previous Message Konstantin Knizhnik 2018-08-07 12:36:16 Re: [HACKERS] Cached plans and statement generalization