Re: ON CONFLICT DO UPDATE for partitioned tables

From: Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>
To: Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>
Cc: Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, Andres Freund <andres(at)anarazel(dot)de>, Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Peter Geoghegan <pg(at)bowt(dot)ie>
Subject: Re: ON CONFLICT DO UPDATE for partitioned tables
Date: 2018-03-22 09:31:59
Message-ID: d50382eb-79ba-66ba-4382-28858426fc99@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Fujita-san, Pavan,

Thank you both for reviewing. Replying to both emails here.

On 2018/03/20 20:53, Etsuro Fujita wrote:
> Here are comments on executor changes in (the latest version of) the patch:
>
> @@ -421,8 +424,18 @@ ExecInsert(ModifyTableState *mtstate,
>              ItemPointerData conflictTid;
>              bool        specConflict;
>              List       *arbiterIndexes;
> +            PartitionTupleRouting *proute =
> +                                        mtstate->mt_partition_tuple_routing;
>
> -            arbiterIndexes = node->arbiterIndexes;
> +            /* Use the appropriate list of arbiter indexes. */
> +            if (mtstate->mt_partition_tuple_routing != NULL)
> +            {
> +                Assert(partition_index >= 0 && proute != NULL);
> +                arbiterIndexes =
> +                        proute->partition_arbiter_indexes[partition_index];
> +            }
> +            else
> +                arbiterIndexes = node->arbiterIndexes;
>
> To handle both cases the same way, I wonder if it would be better to have
> the arbiterindexes list in ResultRelInfo as well, as mentioned by Alvaro
> upthread, or to re-add mt_arbiterindexes as before and set it to
> proute->partition_arbiter_indexes[partition_index] before we get here,
> maybe in ExecPrepareTupleRouting, in the case of tuple routing.

It's a good idea. I somehow missed that Alvaro had already mentioned it.

In HEAD, we now have ri_onConflictSetProj and ri_onConflictSetWhere. I
propose we name the field ri_onConflictArbiterIndexes as done in the
updated patch.

>  ExecOnConflictUpdate(ModifyTableState *mtstate,
>                       ResultRelInfo *resultRelInfo,
> +                     TupleDesc onConflictSetTupdesc,
>                       ItemPointer conflictTid,
>                       TupleTableSlot *planSlot,
>                       TupleTableSlot *excludedSlot,
> @@ -1419,6 +1459,7 @@ ExecOnConflictUpdate(ModifyTableState *mtstate,
>      ExecCheckHeapTupleVisible(estate, &tuple, buffer);
>
>      /* Store target's existing tuple in the state's dedicated slot */
> +    ExecSetSlotDescriptor(mtstate->mt_existing, RelationGetDescr(relation));
>      ExecStoreTuple(&tuple, mtstate->mt_existing, buffer, false);
>
>      /*
> @@ -1462,6 +1503,7 @@ ExecOnConflictUpdate(ModifyTableState *mtstate,
>      }
>
>      /* Project the new tuple version */
> +    ExecSetSlotDescriptor(mtstate->mt_conflproj, onConflictSetTupdesc);
>      ExecProject(resultRelInfo->ri_onConflictSetProj);
>
> Can we do ExecSetSlotDescriptor for mtstate->mt_existing and
> mtstate->mt_conflproj in ExecPrepareTupleRouting in the case of tuple
> routing?  That would make the API changes to ExecOnConflictUpdate
> unnecessary.

That's a good idea too, so done.

>
> @@ -2368,9 +2419,13 @@ ExecInitModifyTable(ModifyTable *node, EState
> *estate, int eflags)
>          econtext = mtstate->ps.ps_ExprContext;
>          relationDesc = resultRelInfo->ri_RelationDesc->rd_att;
>
> -        /* initialize slot for the existing tuple */
> -        mtstate->mt_existing =
> -            ExecInitExtraTupleSlot(mtstate->ps.state, relationDesc);
> +        /*
> +         * Initialize slot for the existing tuple.  We determine which
> +         * tupleDesc to use for this after we have determined which relation
> +         * the insert/update will be applied to, possibly after performing
> +         * tuple routing.
> +         */
> +        mtstate->mt_existing = ExecInitExtraTupleSlot(mtstate->ps.state,
> NULL);
>
>          /* carried forward solely for the benefit of explain */
>          mtstate->mt_excludedtlist = node->exclRelTlist;
> @@ -2378,8 +2433,16 @@ ExecInitModifyTable(ModifyTable *node, EState
> *estate, int eflags)
>          /* create target slot for UPDATE SET projection */
>          tupDesc = ExecTypeFromTL((List *) node->onConflictSet,
>                                   relationDesc->tdhasoid);
> +        PinTupleDesc(tupDesc);
> +        mtstate->mt_conflproj_tupdesc = tupDesc;
> +
> +        /*
> +         * Just like the "existing tuple" slot, we'll defer deciding which
> +         * tupleDesc to use for this slot to a point where tuple routing has
> +         * been performed.
> +         */
>          mtstate->mt_conflproj =
> -            ExecInitExtraTupleSlot(mtstate->ps.state, tupDesc);
> +            ExecInitExtraTupleSlot(mtstate->ps.state, NULL);
>
> If we do ExecInitExtraTupleSlot for mtstate->mt_existing and
> mtstate->mt_conflproj in ExecPrepareTupleRouting in the case of tuple
> routing, as said above, we wouldn't need this changes.  I think doing that
> only in the case of tuple routing and keeping this as-is would be better
> because that would save cycles in the normal case.

Hmm, I think we shouldn't be doing ExecInitExtraTupleSlot in
ExecPrepareTupleRouting, because we shouldn't have more than one instance
of mtstate->mt_existing and mtstate->mt_conflproj slots.

As you also said above, I think you meant to say here that we do
ExecInitExtraTupleSlot only once for both mtstate->mt_existing and
mtstate->mt_conflproj in ExecInitModifyTable and only do
ExecSetSlotDescriptor in ExecPrepareTupleRouting. I have changed it so
that ExecInitModifyTable now both creates the slot and sets the descriptor
for non-tuple-routing cases and only creates but doesn't set the
descriptor in the tuple-routing case.

For ExecPrepareTupleRouting to be able to access the tupDesc of the
onConflictSet target list, I've added ri_onConflictSetProjTupDesc which is
set by ExecInitPartitionInfo on first call for a give partition. This is
also suggested by Pavan in his review.

Considering all of that, both mt_conflproj_tupdesc and
partition_conflproj_tdescs (the array in PartitionTupleRouting) no longer
exist in the patch. And since arbiterIndexes has been moved into
ResultRelInfo too, partition_arbiter_indexes (the array in
PartitionTupleRouting) is gone too.

On 2018/03/22 13:34, Pavan Deolasee wrote:
> Thanks for writing a new version. A few comments:
>
>
> <listitem>
> <para>
> - Using the <literal>ON CONFLICT</literal> clause with partitioned
> tables
> - will cause an error if the conflict target is specified (see
> - <xref linkend="sql-on-conflict" /> for more details on how the
> clause
> - works). Therefore, it is not possible to specify
> - <literal>DO UPDATE</literal> as the alternative action, because
> - specifying the conflict target is mandatory in that case. On the
> other
> - hand, specifying <literal>DO NOTHING</literal> as the alternative
> action
> - works fine provided the conflict target is not specified. In that
> case,
> - unique constraints (or exclusion constraints) of the individual leaf
> - partitions are considered.
> - </para>
> - </listitem>
>
> We should document it somewhere that partition key update is not supported
> by
> ON CONFLICT DO UPDATE

Agreed. I have added a line on INSERT reference page to mention this
limitation.

> /*
> * get_partition_parent
> + * Obtain direct parent or topmost ancestor of given relation
> *
> - * Returns inheritance parent of a partition by scanning pg_inherits
> + * Returns direct inheritance parent of a partition by scanning
> pg_inherits;
> + * or, if 'getroot' is true, the topmost parent in the inheritance
> hierarchy.
> *
> * Note: Because this function assumes that the relation whose OID is
> passed
> * as an argument will have precisely one parent, it should only be called
> * when it is known that the relation is a partition.
> */
>
> Given that most callers only look for immediate parent, I wonder if it makes
> sense to have a new function, get_partition_root(), instead of changing
> signature of the current function. That will reduce foot-print of this patch
> quite a lot.

It seems like a good idea, so done that way.

> @@ -36,6 +38,7 @@ static char
> *ExecBuildSlotPartitionKeyDescription(Relation rel,
> Datum *values,
> bool *isnull,
> int maxfieldlen);
> +static List *adjust_onconflictset_tlist(List *tlist, TupleConversionMap
> *map);
>
> We should name this function in a more generic way, given that it's going
> to be
> used for other things too. What about adjust_partition_tlist?

I think that makes sense. We were trying to use adjust_inherited_tlist in
the earlier versions of this patch, so adjust_partition_tlist sounds like
a good name for this piece of code.

> +
> + /*
> + * If partition's tuple descriptor differs from the root parent,
> + * we need to adjust the onConflictSet target list to account for
> + * differences in attribute numbers.
> + */
> + if (map != NULL)
> + {
> + /*
> + * First convert Vars to contain partition's atttribute
> + * numbers.
> + */
> +
> + /* Convert Vars referencing EXCLUDED pseudo-relation. */
> + onconflset = map_partition_varattnos(node->onConflictSet,
> + INNER_VAR,
> + partrel,
> + firstResultRel, NULL);
>
> Are we not modifying node->onConflictSet in place? Or does
> map_partition_varattnos() create a fresh copy before scribbling on the
> input?
> If it's former then I guess that's a problem. If it's latter then we ought
> to
> have comments explaining that.

A copy is made before scribbling. Clarified that in the nearby comment.

> + tupDesc = ExecTypeFromTL(onconflset, partrelDesc->tdhasoid);
> + PinTupleDesc(tupDesc);
>
> Why do we need to pin the descriptor? If we do need, why don't we need
> corresponding ReleaseTupDesc() call?

PinTupleDesc was added in the patch as Alvaro had submitted it upthread,
which it wasn't clear to me either why it was needed. Looking at it
closely, it seems we can get rid of it if for the lack of corresponding
ReleaseTupleDesc(). ExecSetSlotDescriptor() seems to take care of pinning
and releasing tuple descriptors that are passed to it. If some
partition's tupDesc remains pinned because it was the last one that was
passed to it, the final ExecResetTupleTable will take care of releasing it.

I have removed the instances of PinTupleDesc in the updated patch, but I'm
not sure why the loose PinTupleDesc() in the previous version of the patch
didn't cause reference leak warnings or some such.

> I am still confused if the partition_conflproj_tdescs really belongs to
> PartitionTupleRouting or should it be a part of the ResultRelInfo. FWIW for
> the
> MERGE patch, I moved everything to a new struct and made it part of the
> ResultRelInfo. If no re-mapping is necessary, we can just point to the
> struct
> in the root relation's ResultRelInfo. Otherwise create and populate a new
> one
> in the partition relation's ResultRelInfo.
>
> + leaf_part_rri->ri_onConflictSetWhere =
> + ExecInitQual(onconflwhere, &mtstate->ps);
> + }
>
> So ri_onConflictSetWhere and ri_onConflictSetProj are part of the
> ResultRelInfo. Why not move mt_conflproj_tupdesc,
> partition_conflproj_tdescs,
> partition_arbiter_indexes etc to the ResultRelInfo as well?

I have moved both the projection tupdesc and the arbiter indexes list into
ResultRelInfo as I wrote above.

> +
> +/*
> + * Adjust the targetlist entries of an inherited ON CONFLICT DO UPDATE
> + * operation for a given partition
> + *
>
> As I said above, we should disassociate this function from ON CONFLICT DO
> UPDATE and rather have it as a general purpose facility.

OK, have fixed the comment and the name as mentioned above.

> + * The expressions have already been fixed, but we have to make sure that
> the
> + * target resnos match the partition. In some cases, this can force us to
> + * re-order the tlist to preserve resno ordering.
> + *
>
> Can we have some explanation regarding how the targetlist is reordered? I
> know
> the function does that by updating the resno in place, but some explanation
> would help. Also, should we add an assertion-build check to confirm that the
> resultant list is actually ordered?

OK, added a comment and also the assertion-build check on the order of
entries.

>
> @@ -66,7 +67,8 @@ static TupleTableSlot
> *ExecPrepareTupleRouting(ModifyTableState *mtstate,
> EState *estate,
> PartitionTupleRouting *proute,
> ResultRelInfo *targetRelInfo,
> - TupleTableSlot *slot);
> + TupleTableSlot *slot,
> + int *partition_index);
> static ResultRelInfo *getTargetResultRelInfo(ModifyTableState *node);
> static void ExecSetupChildParentMapForTcs(ModifyTableState *mtstate);
> static void ExecSetupChildParentMapForSubplan(ModifyTableState *mtstate);
> @@ -264,6 +266,7 @@ ExecInsert(ModifyTableState *mtstate,
> TupleTableSlot *slot,
> TupleTableSlot *planSlot,
> EState *estate,
> + int partition_index,
> bool canSetTag)
> {
> HeapTuple tuple;
>
> If we move the list of arbiter indexes and the tuple desc to ResultRelInfo,
> as
> suggested above, I think we can avoid making any API changes to
> ExecPrepareTupleRouting and ExecInsert.

Those API changes are no longer part of the patch.

Attached please find an updated version.

Thanks,
Amit

Attachment Content-Type Size
v8-0001-Fix-ON-CONFLICT-to-work-with-partitioned-tables.patch text/plain 35.7 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Fabien COELHO 2018-03-22 09:32:13 Re: pg_stat_statements HLD for futur developments
Previous Message John Naylor 2018-03-22 09:15:05 Re: WIP: a way forward on bootstrap data