Re: [HACKERS] UPDATE of partition key

From: Amit Khandekar <amitdkhan(dot)pg(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
Subject: Re: [HACKERS] UPDATE of partition key
Date: 2018-01-01 16:13:14
Message-ID: CAJ3gD9f6znZrgfrC6OkAy6YxUQiGSRX-Pr9cOEe7fAOHt1CSvw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 16 December 2017 at 03:09, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> started another review pass over the main patch, so here are
> some comments about that.

I am yet to address all the comments, but meanwhile, below are some
specific points ...

> + if (!partrel)
> + {
> + /*
> + * We locked all the partitions above including the leaf
> + * partitions. Note that each of the newly opened relations in
> + * *partitions are eventually closed by the caller.
> + */
> + partrel = heap_open(leaf_oid, NoLock);
> + InitResultRelInfo(leaf_part_rri,
> + partrel,
> + resultRTindex,
> + rel,
> + estate->es_instrument);
> + }
>
> Hmm, isn't there a problem here? Before, we opened all the relations
> here and the caller closed them all. But now, we're only opening some
> of them. If the caller closes them all, then they will be closing
> some that we opened and some that we didn't. That seems quite bad,
> because the reference counts that are incremented and decremented by
> opening and closing should all end up at 0. Maybe I'm confused
> because it seems like this would break in any scenario where even 1
> relation was already opened and surely you must have tested that
> case... but if there's some reason this works, I don't know what it
> is, and the comment doesn't tell me.

In ExecCleanupTupleRouting(), we are closing only those newly opened
partitions. We skip those which are actually part of the update result
rels.

> + /*
> + * UPDATEs set the transition capture map only when a new subplan
> + * is chosen. But for INSERTs, it is set for each row. So after
> + * INSERT, we need to revert back to the map created for UPDATE;
> + * otherwise the next UPDATE will incorrectly use the one created
> + * for INESRT. So first save the one created for UPDATE.
> + */
> + if (mtstate->mt_transition_capture)
> + saved_tcs_map = mtstate->mt_transition_capture->tcs_map;
>
> I wonder if there is some more elegant way to handle this problem.
> Basically, the issue is that ExecInsert() is stomping on
> mtstate->mt_transition_capture, and your solution is to save and
> restore the value you want to have there. But maybe we could instead
> find a way to get ExecInsert() not to stomp on that state in the first
> place. It seems like the ON CONFLICT stuff handled that by adding a
> second TransitionCaptureState pointer to ModifyTable, thus
> mt_transition_capture and mt_oc_transition_capture. By that
> precedent, we could add mt_utr_transition_capture or similar, and
> maybe that's the way to go. It seems a bit unsatisfying, but so does
> what you have now.

In case of ON CONFLICT, if there are both INSERT and UPDATE statement
triggers referencing transition tables, both of the triggers need to
independently populate their own transition tables, and hence the need
for two separate transition states : mt_transition_capture and
mt_oc_transition_capture. But in case of update-tuple-routing, the
INSERT statement trigger won't come into picture. So the same
mt_transition_capture can serve the purpose of populating the
transition table with OLD and NEW rows. So I think it would be too
redundant, if not incorrect, to have a whole new transition state for
update tuple routing.

I will see if it turns out better to have two tcs_maps in
TransitionCaptureState, one for update and one for insert. But this,
on first look, does not look good.

> + * If per-leaf map is required and the map is already created, that map
> + * has to be per-leaf. If that map is per-subplan, we won't be able to
> + * access the maps leaf-partition-wise. But if the map is per-leaf, we
> + * will be able to access the maps subplan-wise using the
> + * subplan_partition_offsets map using function
> + * tupconv_map_for_subplan(). So if the callers might need to access
> + * the map both leaf-partition-wise and subplan-wise, they should make
> + * sure that the first time this function is called, it should be
> + * called with perleaf=true so that the map created is per-leaf, not
> + * per-subplan.
>
> This sounds complicated and fragile. It ends up meaning that
> mt_childparent_tupconv_maps is sometimes indexed by subplan number and
> sometimes by partition leaf index, which is extremely confusing and
> likely to lead to coding errors, either in this patch or in future
> ones.

Even if we always index the map by leaf partition, while accessing the
map the code still needs to be aware of whether the index number with
which we are accessing the map is the subplan number or leaf partition
number:

If the access is by subplan number, use subplan_partition_offsets to
convert to the leaf partition index. So the function
tupconv_map_for_subplan() is anyways necessary for accessing using
subplan index. Only thing that will change is :
tupconv_map_for_subplan() will not have to check if the the map is
indexed by leaf partition or not. But that complexity is hidden in
this function alone; the outside code need not worry about that.

If the access is by leaf partition number, I think you are worried
here that the map might have been incorrectly indexed by subplan, and
the code might access it partition-wise. Currently we access the map
by leaf-partition-index only when setting up
mtstate->mt_*transition_capture->tcs_map during inserts. At that
place, there is an Assert(mtstate->mt_is_tupconv_perpart == true). May
be, we can have another function tupconv_map_for_partition() rather
than directly accessing mt_childparent_tupconv_maps[], and have this
Assert() in that function. What do you say ?

I am more inclined towards avoiding an always-leaf-partition-indexed
map for additional reasons mentioned below ...

> Would it be reasonable to just always do this by partition leaf
> index, even if we don't strictly need that set of mappings?

If there are no transition tables in picture, we don't require
per-leaf child-parent conversion. So, this would mean that the tuple
conversion maps will be set up for all (say, 100) leaf partitions even
if there are only, say, a couple of update plans. I feel this would
unnecessarily increase the startup cost of update-partition-key
operation.

--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Noah Misch 2018-01-01 18:18:01 Re: Increasing timeout of poll_query_until for TAP tests
Previous Message Andrew Dunstan 2018-01-01 16:01:32 Re: ALTER TABLE ADD COLUMN fast default