Re: ATTACH/DETACH PARTITION CONCURRENTLY

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Simon Riggs <simon(at)2ndquadrant(dot)com>
Subject: Re: ATTACH/DETACH PARTITION CONCURRENTLY
Date: 2018-08-13 16:00:34
Message-ID: CA+Tgmoa_uyR0j2VEkggAn7uRnQz64WiTSGSwzQX+N5d8K9KhhQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sun, Aug 12, 2018 at 9:05 AM, David Rowley
<david(dot)rowley(at)2ndquadrant(dot)com> wrote:
> This is not a fully baked idea, but I'm wondering if a better way to
> do this, instead of having this PartitionIsValid macro to determine if
> the partition should be visible to the current transaction ID, we
> could, when we invalidate a relcache entry, send along the transaction
> ID that it's invalid from. Other backends when they process the
> invalidation message they could wipe out the cache entry only if their
> xid is >= the invalidation's "xmax" value. Otherwise, just tag the
> xmax onto the cache somewhere and always check it before using the
> cache (perhaps make it part of the RelationIsValid macro).

Transactions don't necessarily commit in XID order, so this might be
an optimization to keep older transactions from having to do
unnecessary rebuilds -- which I actually doubt is a major problem, but
maybe I'm wrong -- but you can't rely solely on this as a way of
deciding which transactions will see the effects of some change. If
transactions 100, 101, and 102 begin in that order, and transaction
101 commits, there's no principled justification for 102 seeing its
effects but 100 not seeing it.

> This would
> also require that we move away from SnapshotAny type catalogue scans
> in favour of MVCC scans so that backends populating their relcache
> build it based on their current xid.

I think this is a somewhat confused analysis. We don't use
SnapshotAny for catalog scans, and we never have. We used to use
SnapshotNow, and we now use a current MVCC snapshot. What you're
talking about, I think, is possibly using the transaction snapshot
rather than a current MVCC snapshot for the catalog scans.

I've thought about similar things, but I think there's a pretty deep
can of worms. For instance, if you built a relcache entry using the
transaction snapshot, you might end up building a seemingly-valid
relcache entry for a relation that has been dropped or rewritten.
When you try to access the relation data, you'll be attempt to access
a relfilenode that's not there any more. Similarly, if you use an
older snapshot to build a partition descriptor, you might thing that
relation OID 12345 is still a partition of that table when in fact
it's been detached - and, maybe, altered in other ways, such as
changing column types.

It seems to me that overall you're not really focusing on the right
set of issues here. I think the very first thing we need to worry
about how how we're going to keep the executor from following a bad
pointer and crashing. Any time the partition descriptor changes, the
next relcache rebuild is going to replace rd_partdesc and free the old
one, but the executor may still have the old pointer cached in a
structure or local variable; the next attempt to dereference it will
be looking at freed memory, and kaboom. Right now, we prevent this by
not allowing the partition descriptor to be modified while there are
any queries running against the partition, so while there may be a
rebuild, the old pointer will remain valid (cf. keep_partdesc). I
think that whatever scheme you/we choose here should be tested with a
combination of CLOBBER_CACHE_ALWAYS and multiple concurrent sessions
-- one of them doing DDL on the table while the other runs a long
query.

Once we keep it from blowing up, the second question is what the
semantics are supposed to be. It seems inconceivable to me that the
set of partitions that an in-progress query will scan can really be
changed on the fly. I think we're going to have to rule that if you
add or remove partitions while a query is running, we're going to scan
exactly the set we had planned to scan at the beginning of the query;
anything else would require on-the-fly plan surgery to a degree that
seems unrealistic. That means that when a new partition is attached,
already-running queries aren't going to scan it. If they did, we'd
have big problems, because the transaction snapshot might see rows in
those tables from an earlier time period during which that table
wasn't attached. There's no guarantee that the data at that time
conformed to the partition constraint, so it would be pretty
problematic to let users see it. Conversely, when a partition is
detached, there may still be scans from existing queries hitting it
for a fairly arbitrary length of time afterwards. That may be
surprising from a locking point of view or something, but it's correct
as far as MVCC goes. Any changes made after the DETACH operation
can't be visible to the snapshot being used for the scan.

Now, what we could try to change on the fly is the set of partitions
that are used for tuple routing. For instance, suppose we're
inserting a long stream of COPY data. At some point, we attach a new
partition from another session. If we then encounter a row that
doesn't route to any of the partitions that existed at the time the
query started, we could - instead of immediately failing - go and
reload the set of partitions that are available for tuple routing and
see if the new partition which was concurrently added happens to be
appropriate to the tuple we've got. If so, we could route the tuple
to it. But all of this looks optional. If new partitions aren't
available for insert/update tuple routing until the start of the next
query, that's not a catastrophe. The reverse direction might be more
problematic: if a partition is detached, I'm not sure how sensible it
is to keep routing tuples into it. On the flip side, what would
break, really?

Given the foregoing, I don't see why you need something like
PartitionIsValid() at all, or why you need an algorithm similar to
CREATE INDEX CONCURRENTLY. The problem seems mostly different. In
the case of CREATE INDEX CONCURRENTLY, the issue is that any new
tuples that get inserted while the index creation is in progress need
to end up in the index, so you'd better not start building the index
on the existing tuples until everybody who might insert new tuples
knows about the index. I don't see that we have the same kind of
problem in this case. Each partition is basically a separate table
with its own set of indexes; as long as queries don't end up with one
notion of which tables are relevant and a different notion of which
indexes are relevant, we shouldn't end up with any table/index
inconsistencies. And it's not clear to me what other problems we
actually have here. To put it another way, if we've got the notion of
"isvalid" for a partition, what's the difference between a partition
that exists but is not yet valid and one that exists and is valid? I
can't think of anything, and I have a feeling that you may therefore
be inventing a lot of infrastructure that is not necessary.

I'm inclined to think that we could drop the name CONCURRENTLY from
this feature altogether and recast it as work to reduce the lock level
associated with partition attach/detach. As long as we have a
reasonable definition of what the semantics are for already-running
queries, and clear documentation to go with those semantics, that
seems fine. If a particular user finds the concurrent behavior too
strange, they can always perform the DDL in a transaction that uses
LOCK TABLE first, removing the concurrency.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2018-08-13 16:53:18 Re: Temporary tables prevent autovacuum, leading to XID wraparound
Previous Message Dmitry Dolgov 2018-08-13 15:49:07 Re: WIP: "More fair" LWLocks