Re: ATTACH/DETACH PARTITION CONCURRENTLY

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>
Cc: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Simon Riggs <simon(at)2ndquadrant(dot)com>
Subject: Re: ATTACH/DETACH PARTITION CONCURRENTLY
Date: 2018-11-14 19:27:31
Message-ID: CA+TgmoZE0r9-cyA-aY6f8WFEROaDLLL7Vf81kZ8MtFCkxpeQSw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Nov 9, 2018 at 9:50 AM Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> I had another idea, too. I think we might be able to reuse the
> technique Noah invented in 4240e429d0c2d889d0cda23c618f94e12c13ade7.
> That is:
>
> - make a note of SharedInvalidMessageCounter before doing any of the
> relevant catalog lookups
> - do them
> - AcceptInvalidationMessages()
> - if SharedInvalidMessageCounter has changed, discard all the data we
> collected and retry from the top
>
> I believe that is sufficient to guarantee that whatever we construct
> will have a consistent view of the catalogs which is the most recent
> available view as of the time we do the work. And with this approach
> I believe we can continue to use syscache lookups to get the data
> rather than having to use actual index scans, which is nice.

Here are a couple of patches to illustrate this approach to this part
of the overall problem. 0001 is, I think, a good cleanup that may as
well be applied in isolation; it makes the code in
RelationBuildPartitionDesc both cleaner and more efficient. 0002
adjust things so that - I hope - the partition bounds we get for the
individual partitions has to be as of the same point in the commit
sequence as the list of children. As I noted before, Alvaro's patch
doesn't seem to have tackled this part of the problem.

Another part of the problem is finding a way to make sure that if we
execute a query (or plan one), all parts of the executor (or planner)
see the same PartitionDesc for every relation. In the case of
parallel query, I think it's important to try to get consistency not
only within a single backend but also across backends. I'm thinking
about perhaps creating an object with a name like
PartitionDescDirectory which can optionally attach to dynamic shared
memory. It would store an OID -> PartitionDesc mapping in local
memory, and optionally, an additional OID -> serialized-PartitionDesc
in DSA. When given an OID, it would check the local hash table first,
and then if that doesn't find anything, check the shared hash table if
there is one. If an entry is found there, deserialize and add to the
local hash table. We'd then hang such a directory off of the EState
for the executor and the PlannerInfo for the planner. As compared
with Alvaro's proposal, this approach has the advantage of not
treating parallel query as a second-class citizen, and also of keeping
partitioning considerations out of the snapshot handling, which as I
said before seems to me to be a good idea.

One thing which was vaguely on my mind in earlier emails but which I
think I can now articulate somewhat more clearly is this: In some
cases, a consistent but outdated view of the catalog state is just as
bad as an inconsistent view of the catalog state. For example, it's
not OK to decide that a tuple can be placed in a certain partition
based on an outdated list of relation constraints, including the
partitioning constraint - nor is it OK to decide that a partition can
be pruned based on an old view of the partitioning constraint. I
think this means that whenever we change a partition's partitioning
constraint, we MUST take AccessExclusiveLock on the partition.
Otherwise, a heap_insert() [or a partition pruning decision] can be in
progress on that relation in one relation at the same time that some
other partition is changing the partition constraint, which can't
possibly work. The best we can reasonably do is to reduce the locking
level on the partitioned table itself.

A corollary is that holding the PartitionDescs that a particular query
sees for a particular relation fixed, whether by the method Alvaro
proposes or by what I am proposing here or by some other method is not
a panacea. For example, the ExecPartitionCheck call in copy.c
sometimes gets skipped on the theory that if tuple routing has sent us
to partition X, then the tuple being routed must satisfy the partition
constraint for that partition. But that's not true if we set up tuple
routing using one view of the catalogs, and then things changed
afterwards. RelationBuildPartitionDesc doesn't lock the children
whose relpartbounds it is fetching (!), so unless we're guaranteed to
have already locked them children earlier for some other reason, we
could grab the partition bound at this point and then it could change
again before we get a lock on them.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment Content-Type Size
0002-Ensure-that-RelationBuildPartitionDesc-sees-a-consis.patch application/octet-stream 6.1 KB
0001-Reduce-unnecessary-list-construction-in-RelationBuil.patch application/octet-stream 5.7 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Haribabu Kommi 2018-11-14 20:35:43 Re: New function pg_stat_statements_reset_query() to reset statistics of a specific query
Previous Message James Coleman 2018-11-14 18:22:59 Re: Proving IS NOT NULL inference for ScalarArrayOpExpr's