|From:||Robert Haas <robertmhaas(at)gmail(dot)com>|
|To:||Michael Paquier <michael(at)paquier(dot)xyz>|
|Cc:||Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Sergei Kornilov <sk(at)zsrv(dot)org>, Amit Langote <langote_amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Simon Riggs <simon(at)2ndquadrant(dot)com>|
|Subject:||Re: ATTACH/DETACH PARTITION CONCURRENTLY|
|Views:||Raw Message | Whole Thread | Download mbox|
On Tue, Dec 18, 2018 at 8:04 PM Michael Paquier <michael(at)paquier(dot)xyz> wrote:
> On Tue, Dec 18, 2018 at 01:41:06PM -0500, Robert Haas wrote:
> > OK. I'll post what I have by the end of the week.
> Thanks, Robert.
OK, so I got slightly delayed here by utterly destroying my laptop,
but I've mostly reconstructed what I did. I think there are some
remaining problems, but this seems like a good time to share what I've
got so far. I'm attaching three patches.
0001 is one which I posted before. It attempts to fix up
RelationBuildPartitionDesc() so that this function will always return
a partition descriptor based on a consistent snapshot of the catalogs.
Without this, I think there's nothing to prevent a commit which
happens while the function is running from causing the function to
fail or produce nonsense answers.
0002 introduces the concept of a partition directory. The idea is
that the planner will create a partition directory, and so will the
executor, and all calls which occur in those places to
RelationGetPartitionDesc() will instead call
PartitionDirectoryLookup(). Every lookup for the same relation in the
same partition directory is guaranteed to produce the same answer. I
believe this patch still has a number of weaknesses. More on that
0003 actually lowers the lock level. The comment here might need some
Here is a list of possible or definite problems that are known to me:
- I think we need a way to make partition directory lookups consistent
across backends in the case of parallel query. I believe this can be
done with a dshash and some serialization and deserialization logic,
but I haven't attempted that yet.
- I refactored expand_inherited_rtentry() to drive partition expansion
entirely off of PartitionDescs. The reason why this is necessary is
that it clearly will not work to have find_all_inheritors() use a
current snapshot to decide what children we have and lock them, and
then consult a different source of truth to decide which relations to
open with NoLock. There's nothing to keep the lists of partitions
from being different in the two cases, and that demonstrably causes
assertion failures if you SELECT with an ATTACH/DETACH loop running in
the background. However, it also changes the order in which tables get
locked. Possibly that could be fixed by teaching
expand_partitioned_rtentry() to qsort() the OIDs the way
find_inheritance_children() does. It also loses the infinite-loop
protection which find_all_inheritors() has. Not sure what to do about
- In order for the new PartitionDirectory machinery to avoid
use-after-free bugs, we have to either copy the PartitionDesc out of
the relcache into the partition directory or avoid freeing it while it
is still in use. Copying it seems unappealing for performance
reasons, so I took the latter approach. However, what I did here in
terms of reclaiming memory is just about the least aggressive strategy
short of leaking it altogether - it just keeps it around until the
next rebuild that occurs while the relcache entry is not in use. We
might want to do better, e.g. freeing old copies immediately as soon
as the relcache reference count drops to 0. I just did it this way
because it was simple to code.
- I tried this with Alvaro's isolation tests and it fails some tests.
That's because Alvaro's tests expect that the list of accessible
partitions is based on the query snapshot. For reasons I attempted to
explain in the comments in 0003, I think the idea that we can choose
the set of accessible partitions based on the query snapshot is very
wrong. For example, suppose transaction 1 begins, reads an unrelated
table to establish a snapshot, and then goes idle. Then transaction 2
comes along, detaches a partition from an important table, and then
does crazy stuff with that table -- changes the column list, drops it,
reattaches it with different bounds, whatever. Then it commits. If
transaction 1 now comes along and uses the query snapshot to decide
that the detached partition ought to still be seen as a partition of
that partitioned table, disaster will ensue.
- I don't have any tests here, but I think it would be good to add
some, perhaps modeled on Alvaro's, and also some that involve multiple
ATTACH and DETACH operations mixed with other interesting kinds of
DDL. I also didn't make any attempt to update the documentation,
which is another thing that will probably have to be done at some
point. Not sure how much documentation we have of any of this now.
- I am uncertain whether the logic that builds up the partition
constraint is really safe in the face of concurrent DDL. I kinda
suspect there are some problems there, but maybe not. Once you hold
ANY lock on a partition, the partition constraint can't concurrently
become stricter, because no ATTACH can be done without
AccessExclusiveLock on the partition being attached; but it could
concurrently become less strict, because you only need a lesser lock
for a detach. Not sure if/how that could foul with this logic.
- I have not done anything about the fact that index detach takes
The Enterprise PostgreSQL Company
|Next Message||Robert Haas||2018-12-20 20:44:07||Re: lock level for DETACH PARTITION looks sketchy|
|Previous Message||Alvaro Herrera||2018-12-20 19:45:23||Re: lock level for DETACH PARTITION looks sketchy|