Re: Commit 4dba331cb3 broke ATTACH PARTITION behaviour.

From: Rushabh Lathia <rushabh(dot)lathia(at)gmail(dot)com>
To: Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>
Cc: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, jeevan(dot)ladhe(at)enterprisedb(dot)com, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Commit 4dba331cb3 broke ATTACH PARTITION behaviour.
Date: 2018-04-09 11:42:41
Message-ID: CAGPqQf2anMcWCxWtnpa3RDCeZHgBctSW9ZBPTrg2CUfBLMGUwg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Added to the open items list.

https://wiki.postgresql.org/wiki/PostgreSQL_11_Open_Items#Open_Issues

On Tue, Apr 3, 2018 at 2:52 PM, Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>
wrote:

> On 2018/04/03 14:45, Kyotaro HORIGUCHI wrote:
> > Hello.
> >
> > At Mon, 2 Apr 2018 16:11:12 -0300, Alvaro Herrera <
> alvherre(at)alvh(dot)no-ip(dot)org> wrote:
> >> Why do we need AccessExclusiveLock on all children of a relation that we
> >> want to scan to search for rows not satisfying the constraint? I think
> >> it should be enough to get ShareLock, which prevents INSERT, no? I have
> >> a feeling I'm missing something here, but I don't know what, and all
> >> tests pass with that change.
>
> Thinking on this a bit, I see no problem with locking the children with
> just ShareLock. It was just a paranoia that if we're going to lock the
> table itself being attached with AEL, we must its children (if any) with
> AEL, too.
>
> > Mmm. I'm not sure if there's a lock-upgrade case but the
> > following sentense is left at the last of one of the modified
> > comments. ATREwriteTables is called with AEL after that (that has
> > finally no effect in this case).
> >
> > | But we cannot risk a deadlock by
> taking
> > | * a weaker lock now and the stronger one only when needed.
> >
> > I don't actually find places where the children's lock can be
> > raised but this seems just following the lock parent first
> > principle.
>
> No lock upgrade happen as of now. The comment was added by the commit
> 972b6ec20bf [1], which removed the code that could cause such a deadlock.
> The comment fragment is simply trying to explain why we don't postpone the
> locking of children to a later time, say, to the point where we actually
> know that they need to be scanned. Previously the code next to the
> comment used to lock the children using AccessShareLock, because at that
> point we just needed to check if the table being attached is one of those
> children and then later locked with AEL if it turned out that they need to
> be scanned to check the partition constraint.
>
> > By the way check_default_allows_bound() (CREATE TABLE case)
> > contains a similar usage of find_all_inheritors(default_rel,
> > AEL).
>
> Good catch. Its job is more or less same as
> ValidatePartitionConstraints(), except all the children (if any) are
> scanned right away instead of queuing it like in the AT case.
>
> >> Also: the change proposed to remove the get_default_oid_from_partdesc()
> >> call actually fixes the bug, but to me it was not at all obvious why.
> >
> > CommandCounterIncrement updates the content of a relcache entry
> > via invalidation. It can be surprising for callers of a function
> > like StorePartitionBound.
> >
> > CommandCounterIncrement
> > <skip inval mechanism>
> > RelationCacheInvalidateEntry
> > RelationFlushRelation
> > RelationClearRelation
>
> Because of the CCI() after storing the default partition OID into the
> system catalog, RelationClearRelation() would changes what
> rel->rd_partdesc points to where 'rel' is the ATExecAttachPartition()'s
> reference to the relcache entry of the parent that it passed to
> StorePartitionBound.
>
> So, whereas the rel->rd_partdesc wouldn't contain the default partition
> before StorePartitionBound() was called, it would after.
>
> >> To figure out why, I first had to realize that
> >> ValidatePartitionConstraints was lying to me, both in name and in
> >> comments: it doesn't actually validate anything, it merely queues
> >> entries so that alter table's phase 3 would do the validation. I found
> >> this extremely confusing, so I fixed the comments to match reality, and
> >> later decided to rename the function also.
> >
> > It is reasonable. Removing exccessive extension of lower-level
> > partitions is also reasonable. The previous code extended
> > inheritances in different manner for levels at odd and even
> > depth.
>
> I like the new code including the naming, but I notice this changes the
> order in which we do the locking now. There are still sites in the code
> where the locking order is breadth-first, that is, as determined by
> find_all_inheritors(), as this function would too previously.
>
> Also note that beside the breadth-first -> depth-first change, this also
> changes the locking order of partitions for a given partitioned table.
> The OIDs in partdesc->oids[] are canonically ordered (that is order of
> their partition bounds), whereas find_inheritance_children() that's called
> by find_all_inheritors() would lock them in the order in which the
> individual OIDs were found in the system catalog.
>
> Not sure if there is anything to be alarmed of here, but in all previous
> discussions, this has been a thing to avoid.
>
> >> At that point I was able to understand what the problem was: when
> >> attaching the default partition, we were setting the constraint to be
> >> validated for that partition to the correctly computed partition
> >> constraint; and later in the same function we would set the constraint
> >> to "the immature constraint" because we were now seeing that the default
> >> partition OID was not invalid. So it was rather a bug in
> >> ValidatePartitionConstraints, in that it was accepting to set the
> >> expression to validate when another expression had already been set! I
> >> added an assert to protect against this. And then changed the decision
> >> of whether or not to run this block based on the attached partition
> >> being the default one or not; because as the "if" test was, it was just
> >> a recipe for confusion. (Now, if you look carefully you realize that
> >> the new test for bound->is_default I added is kinda redundant: it can
> >> only be set if there was a default partition OID at start of the
> >> function, and therefore we know we're not adding a default partition.
> >> And for the case where we *are* adding the default partition, it is
> >> still Invalid because we don't re-read its own OID. But relying on that
> >> seems brittle because it breaks if we happen to update the default
> >> partition OID in the middle of that function, which is what we were
> >> doing. Hence the new is_default test.)
> >
> > Seems reasonable.
>
> +1
>
> > But even if we assume so, rereading
> > defaultPartOid is still breaking the assumption that
> > defaultPartOid is that at the time of entering to this function
> > and the added condition just conceals the fact.
>
> Afaics, defaultPartOid is only set at the beginning of
> ATExecAttachPartition, so it seems fine.
>
> > So I think it should be an assertion.
> >
> > | if (OidIsValid(defaultPartOid))
> > | {
> > | /*
> > | * The command cannot be adding default partition if the
> > | * defaultPartOid is valid.
> > | */
> > | Assert(!cmd->bound->is_default);
>
> I guess that makes sense, because when trying to attach a default
> partition to the table that already has one, check_new_partition_bound
> that's called before this errors out before we could get here.
>
> >> I looked at the implementation of ValidatePartitionConstraints and
> >> didn't like it, so I rewrote it also.
> >>
> >> This comment is mistaken, too:
> >> - /*
> >> - * Skip if the partition is itself a partitioned table. We can
> only
> >> - * ever scan RELKIND_RELATION relations.
> >> - */
> >> ... because it ignores the possibility of a partition being a foreign
> >> table. The code seems to work, but I think there is no test to cover
> >> the case that a foreign table containing data that doesn't satisfy the
> >> constraint is attached, because when I set that case to return doing
> >> nothing (ie. ATGetQueueEntry is not called for a foreign partition), no
> >> test failed.
> >
> > Foreign tables are intentionally not verified on attaching (in my
> > understanding). ATRewriteTables ignores foreign tables so it
> > contradicts to what ATRewriteTables thinks. As my understanding
> > foreign tables are assumed to be unrestrictable by local
> > constraint after attaching so the users are responsible to the
> > consistency.
>
> That and ATRewriteTable() in phase 3 cannot really cope with being handed
> a foreign table as it can only work with RELKIND_RELATION tables.
> Actually, the following in ATRewriteTables() also prevents passing foreign
> tables:
>
> static void
> ATRewriteTables(AlterTableStmt *parsetree, List **wqueue, LOCKMODE
> lockmode)
> {
> ListCell *ltab;
>
> /* Go through each table that needs to be checked or rewritten */
> foreach(ltab, *wqueue)
> {
> AlteredTableInfo *tab = (AlteredTableInfo *) lfirst(ltab);
>
> /*
> * Foreign tables have no storage, nor do partitioned tables and
> * indexes.
> */
> if (tab->relkind == RELKIND_FOREIGN_TABLE ||
> tab->relkind == RELKIND_PARTITIONED_TABLE ||
> tab->relkind == RELKIND_PARTITIONED_INDEX)
> continue;
>
> >> Generally speaking, I didn't like ATExecAttachPartition; it's doing too
> >> much that should have been done in ATPrepAttachPartition. The only
> >> reason that's not broken is because we don't allow ATTACH PARTITION to
> >> appear together with other commands in alter table, which seems
> >> disappointing ... for example, when attaching multiple tables and a
> >> default partition exist, you have to scan the default one multiple
> >> times, which is lame.
> >
> > Indeed, currently only partition commands are isolated from other
> > alter table subcommands (except all in tablespace cases). We can
> > improve that in the next step?
>
> I think we can improve this.
>
> >> (Speaking of lame: I noticed that RelationGetPartitionQual obtains lock
> >> on the parent relation ... I wonder if this can be used to cause a
> >> deadlock during InitResultRelationInfo.)
>
> It does seem that there is a possibility of deadlock, because when
> InitResultRelInfo(), that's initializing a partition, calls
> RelationGetPartitionQual() to get its partition constraint, the partition
> would already have been locked. So this reverses the generally
> established order of locking the parent first; another session which tries
> to add a column to the parent, for example, will lock the parent and then
> partitions.
>
> I think we need for inserts directly into a partition to lock all of its
> ancestors from the root to the direct parent (in that order) before
> locking the partition itself, or maybe at least the parent if not all
> ancestors.
>
> > Mmm. That seems strange since the RealtionGetPartitionQual
> > doesn't return anything (specifically NIL) there, because we
> > don't allow a partition to be attached to partitioned tables. It
> > seems totally useless.
> >
> > Addition to that the code tries to add the partition qual (which
> > is always NIL) destructively and assign to partConstraint..
>
> It is not always NIL. Imagine attaching a partition at a lower level.
>
> create table foo (a int, b char) partition by list (a);
> create table foo1 partition of foo for values in (1) partition by list (b);
> create table foo1a (a, b) as values (2, 'b');
>
> -- note that we're attaching to foo1, not foo
> alter table foo1 attach partition foo1a for values in ('a');
>
> If we didn't include foo1's (the parent) constraint (that is, a = 1), the
> above command will wrongly succeed. It must include a = 1 in the
> constraint to be be checked when scanning foo1a.
>
> Although, I noticed there is no test covering this.
>
> >> BTW, I think this is already broken for the case where the default
> >> partition is partitioned and you attach a partition colliding with a row
> >> that's being concurrently inserted in a partition of the default
> >> partition, though I didn't bother to write a test for that.
> >
> > How is it broken? Every attaching partitions are checked for the
> > specified partition bound and every partitions of the default
> > partition are also checked against the new default part bound. We
> > already hold required locks on all the participants.
>
> Yes, concurrent insertions to either the default partition or any of its
> partitions couldn't be occurring as we'd have locked them.
>
> >> Anyway, I'm just an onlooker fixing a CommandCounterIncrement change.
> >
> > It's reassuring. Thanks.
>
> Yes, thank you for taking the time out to clean things up.
>
> Thanks,
> Amit
>
> [1]
> https://git.postgresql.org/gitweb/?p=postgresql.git;a=
> commitdiff;h=972b6ec20bf
>
>

--
Rushabh Lathia

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Teodor Sigaev 2018-04-09 11:43:41 Re: [HACKERS] GSoC 2017: weekly progress reports (week 6)
Previous Message Teodor Sigaev 2018-04-09 11:35:05 Re: [HACKERS] GSoC 2017: weekly progress reports (week 6)