ATTACH/DETACH PARTITION CONCURRENTLY

From: David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>
To: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: ATTACH/DETACH PARTITION CONCURRENTLY
Date: 2018-08-02 13:25:02
Message-ID: CAKJS1f9QjUwQrio20Pi=yCHmnouf4z3SfN8sqXaAcwREG6k0zQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

One of the downsides of declarative partitioning vs old school
inheritance partitioning is that a new partition cannot be added to
the partitioned table without taking an AccessExclusiveLock on the
partitioned table. We've obviously got a bunch of features for
various other things where we work a bit harder to get around that
problem, e.g creating indexes concurrently.

I've started working on allowing partitions to be attached and
detached with just a ShareUpdateExclusiveLock on the table. If I'm
correct, then we can do this in a similar, but more simple way as to
how CREATE INDEX CONCURRENTLY works. We just need to pencil in that
the new partition exists, but not yet valid, then wait for snapshots
older than our own to finish before marking the partition is valid.

One problem I had with doing this is that there was not really a good
place to store that "isvalid" flag for partitions. We have pg_index
for indexes, but partition details are just spread over pg_inherits
and pg_class. So step 1 was to move all that into a new table called
pg_partition. I think this is quite nice as it also gets rid of
relpartbound out of pg_class. It probably just a matter of time before
someone complains that they can't create some partition with some
pretty large Datum due to it not being able to fit on a single heap
page (pg_class has no TOAST table). I ended up getting rid of
pg_class.relispartition replacing it with relpartitionparernt which is
just InvalidOid when the table or index is not a partition. This
allows various pieces of code to be more efficient since we can look
at the relcache instead of scanning pg_inherits all the time. It's now
also much faster to get a partitions ancestors.

So, patches 0001 is just one I've already submitted for the July
'fest. Nothing new. It was just required to start this work.

0002 migrates partitions out of pg_inherits into pg_partition. This
patch is at a stage where it appears to work, but is very unpolished
and requires me to stare at it much longer than I've done so far.
There's a bunch of code that gets repeated way too many times in
tablecmds.c, for example.

0003 does the same for partitioned indexes. The patch is in a similar,
maybe slightly worse state than 0002. Various comments will be out of
date.

0004
is the early workings of what I have in mind for the concurrent ATTACH
code. It's vastly incomplete. It does pass make check but really only
because there are no tests doing any concurrent attaches. There's a
mountain of code missing that ignores invalid partitions. I just have
a very simple case working. Partition-wise joins will be very much
broken by what I have so far, and likely a whole bunch of other stuff.

About the extent of my tests so far are the following:

--setup
create table listp (a int) partition by list(a);
create table listp1 partition of listp for values in(1);
create table listp2 (a int);
insert into listp1 values(1);
insert into listp2 values(2);

-- example 1.
start transaction isolation level repeatable read; -- Session 1
select * from listp; -- Session 1
a
---
1
(1 row)

alter table listp attach partition concurrently listp2 for values in
(2); -- Session 2 (waits for release of session 1's snapshot)
select * from listp; -- Session 1
a
---
1

commit; -- session 1 (session 2's alter table now finishes waiting)
select * from listp; -- Session 1 (new partition now valid)
a
---
1
2
(2 rows)

-- example 2.
start transaction isolation level read committed; -- session 1;
select * from listp; -- session 1
a
---
1
(1 row)

alter table listp attach partition concurrently listp2 for values in
(2); -- Session 2 completes without waiting.

select * from listp; -- Session 1 (new partition visible while in transaction)
a
---
1
2
(2 rows)

This basically works by:

1. Do all the normal partition attach partition validation.
2. Insert a record into pg_partition with partisvalid=false
3. Obtain a session-level ShareUpdateExclusiveLock on the partitioned table.
4. Obtain a session-level AccessExclusiveLock on the partition being attached.
5. Commit.
6. Start a new transaction.
7. Wait for snapshots older than our own to be released.
8. Mark the partition as valid
9. Invalidate relcache for the partitioned table.
10. release session-level locks.

I've disallowed the feature when the partitioned table has a default
partition. I don't see how this can be made to work.

At the moment ALTER TABLE ... ATTACH PARTITION commands cannot contain
any other sub-commands in the ALTER TABLE, so performing the
additional transaction commit and begin inside the single sub-command
might be okay. It does mean that 'rel' which is passed down to
ATExecAttachPartition() must be closed and reopened again which
results in the calling function having a pointer into a closed
Relation. I worked around this by changing the code so it passes a
pointer to the Relation, and I've got ATExecAttachPartition() updating
that pointer before returning. It's not particularly pretty, but I
didn't really see how else this can be done.

I've not yet done anything about the DETACH CONCURRENTLY case. I think
it should just be the same steps in some roughly reverse order. We
can skip the waiting part of the partition being detached is still
marked as invalid from some failed concurrent ATTACH.

I've not thought much about pg_dump beyond just have it ignore invalid
partitions. I don't think it's very useful to support some command
that attaches an invalid partition since there will be no command to
revalidate an invalid partition. It's probably best to resolve that
with a DETACH followed by a new ATTACH. So probably pg_dump can just
do nothing for invalid partitions.

So anyway, my intentions of posting this patch now rather than when
it's closer to being finished is for design review. I'm interested in
hearing objections, comments, constructive criticism for patches
0002-0004. Patch 0001 comments can go to [1]

Are there any blockers on this that I've overlooked?

[1] https://www.postgresql.org/message-id/CAKJS1f81TpxZ8twugrWCo%3DVDHEkmagxRx7a%2B1z4aaMeQy%3DnA7w%40mail.gmail.com

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachment Content-Type Size
v1-0001-Speed-up-INSERT-and-UPDATE-on-partitioned-tables.patch application/octet-stream 56.7 KB
v1-0002-Store-partition-details-in-pg_partition-instead-o.patch application/octet-stream 101.6 KB
v1-0003-Don-t-store-partition-index-details-in-pg_inherit.patch application/octet-stream 49.0 KB
v1-0004-Allow-partitions-to-be-attached-without-blocking-.patch application/octet-stream 36.6 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Shay Rojansky 2018-08-02 14:10:05 Re: Stored procedures and out parameters
Previous Message Robert Haas 2018-08-02 13:18:57 Re: Should contrib modules install .h files?