Re: ATTACH/DETACH PARTITION CONCURRENTLY

From: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Simon Riggs <simon(at)2ndquadrant(dot)com>
Subject: Re: ATTACH/DETACH PARTITION CONCURRENTLY
Date: 2018-08-20 20:21:22
Message-ID: 20180820202122.xpcm4j5bntze4kg6@alvherre.pgsql
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2018-Aug-13, Robert Haas wrote:

> I think this is a somewhat confused analysis. We don't use
> SnapshotAny for catalog scans, and we never have. We used to use
> SnapshotNow, and we now use a current MVCC snapshot. What you're
> talking about, I think, is possibly using the transaction snapshot
> rather than a current MVCC snapshot for the catalog scans.
>
> I've thought about similar things, but I think there's a pretty deep
> can of worms. For instance, if you built a relcache entry using the
> transaction snapshot, you might end up building a seemingly-valid
> relcache entry for a relation that has been dropped or rewritten.
> When you try to access the relation data, you'll be attempt to access
> a relfilenode that's not there any more. Similarly, if you use an
> older snapshot to build a partition descriptor, you might thing that
> relation OID 12345 is still a partition of that table when in fact
> it's been detached - and, maybe, altered in other ways, such as
> changing column types.

I wonder if this all stems from a misunderstanding of what I suggested
to David offlist. My suggestion was that the catalog scans would
continue to use the catalog MVCC snapshot, and that the relcache entries
would contain all the partitions that appear to the catalog; but each
partition's entry would carry the Xid of the creating transaction in a
field (say xpart), and that field is compared to the regular transaction
snapshot: if xpart is visible to the transaction snapshot, then the
partition is visible, otherwise not. So you never try to access a
partition that doesn't exist, because those just don't appear at all in
the relcache entry. But if you have an old transaction running with an
old snapshot, and the partitioned table just acquired a new partition,
then whether the partition will be returned as part of the partition
descriptor or not depends on the visibility of its entry.

I think that works fine for ATTACH without any further changes. I'm not
so sure about DETACH, particularly when snapshots persist for a "long
time" (a repeatable-read transaction). ISTM that in the above design,
the partition descriptor would lose the entry for the detached partition
ahead of time, which means queries would silently fail to see their data
(though they wouldn't crash). I first thought this could be fixed by
waiting for those snapshots to finish, but then I realized that there's
no actual place where waiting achieves anything. Certainly it's not
useful to wait before commit (because other snapshots are going to be
starting all the time), and it's not useful to start after the commit
(because by then the catalog tuple is already gone). Maybe we need two
transactions: mark partition as removed with an xmax of sorts, commit,
wait for all snapshots, start transaction, remove partition catalog
tuple, commit.

--
Álvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bossart, Nathan 2018-08-20 20:57:00 Re: Improve behavior of concurrent ANALYZE/VACUUM
Previous Message David G. Johnston 2018-08-20 17:40:13 Re: Getting NOT NULL constraint from pg_attribute