Re: making update/delete of inheritance trees scale better

From: Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>
To: Amit Langote <amitlangote09(at)gmail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: making update/delete of inheritance trees scale better
Date: 2020-05-11 12:58:15
Message-ID: CAExHW5ts3cs1JLgSzGRwO0F6yvR1Ts6qj0n8PnO6UoV8EPkTpg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, May 8, 2020 at 7:03 PM Amit Langote <amitlangote09(at)gmail(dot)com> wrote:
>
> Here is a sketch for implementing the design that Tom described here:
> https://www.postgresql.org/message-id/flat/357.1550612935%40sss.pgh.pa.us
>
> In short, we would like to have only one plan for ModifyTable to get
> tuples out of to update/delete, not N for N child result relations as
> is done currently.
>
> I suppose things are the way they are because creating a separate plan
> for each result relation makes the job of ModifyTable node very
> simple, which is currently this:
>
> 1. Take the plan's output tuple, extract the tupleid of the tuple to
> update/delete in the currently active result relation,
> 2. If delete, go to 3, else if update, filter out the junk columns
> from the above tuple
> 3. Call ExecUpdate()/ExecDelete() on the result relation with the new
> tuple, if any
>
> If we make ModifyTable do a little more work for the inheritance case,
> we can create only one plan but without "expanding" the targetlist.
> That is, it will contain entries only for attributes that are assigned
> values in the SET clause. This makes the plan reusable across result
> relations, because all child relations must have those attributes,
> even though the attribute numbers might be different. Anyway, the
> work that ModifyTable will now have to do is this:
>
> 1. Take the plan's output tuple, extract tupleid of the tuple to
> update/delete and "tableoid"
> 2. Select the result relation to operate on using the tableoid
> 3. If delete, go to 4, else if update, fetch the tuple identified by
> tupleid from the result relation and fill in the unassigned columns
> using that "old" tuple, also filtering out the junk columns
> 4. Call ExecUpdate()/ExecDelete() on the result relation with the new
> tuple, if any
>
> I do think that doing this would be worthwhile even if we may be
> increasing ModifyTable's per-row overhead slightly, because planning
> overhead of the current approach is very significant, especially for
> partition trees with beyond a couple of thousand partitions. As to
> how bad the problem is, trying to create a generic plan for `update
> foo set ... where key = $1`, where foo has over 2000 partitions,
> causes OOM even on a machine with 6GB of memory.

Per row overhead would be incurred for every row whereas the plan time
overhead is one-time or in case of a prepared statement almost free.
So we need to compare it esp. when there are 2000 partitions and all
of them are being updated. But generally I agree that this would be a
better approach. It might help using PWJ when the result relation
joins with other partitioned table. I am not sure whether that
effectively happens today by partition pruning. More on this later.

>
> The one plan shared by all result relations will be same as the one we
> would get if the query were SELECT, except it will contain junk
> attributes such as ctid needed to identify tuples and a new "tableoid"
> junk attribute if multiple result relations will be present due to
> inheritance. One major way in which this targetlist differs from the
> current per-result-relation plans is that it won't be passed through
> expand_targetlist(), because the set of unassigned attributes may not
> be unique among children. As mentioned above, those missing
> attributes will be filled by ModifyTable doing some extra work,
> whereas previously they would have come with the plan's output tuple.
>
> For child result relations that are foreign tables, their FDW adds
> junk attribute(s) to the query’s targetlist by updating it in-place
> (AddForeignUpdateTargets). However, as the child tables will no
> longer get their own parsetree, we must use some hack around this
> interface to obtain the foreign table specific junk attributes and add
> them to the original/parent query’s targetlist. Assuming that all or
> most of the children will belong to the same FDW, we will end up with
> only a handful such junk columns in the final targetlist. I am not
> sure if it's worthwhile to change the API of AddForeignUpdateTargets
> to require FDWs to not scribble on the passed-in parsetree as part of
> this patch.

What happens if there's a mixture of foreign and local partitions or
mixture of FDWs? Injecting junk columns from all FDWs in the top level
target list will cause error because those attributes won't be
available everywhere.

>
> As for how ModifyTable will create the new tuple for updates, I have
> decided to use a ProjectionInfo for each result relation, which
> projects a full, *clean* tuple ready to be put back into the relation.
> When projecting, plan’s output tuple serves as OUTER tuple and the old
> tuple fetched to fill unassigned attributes serves as SCAN tuple. By
> having this ProjectionInfo also serve as the “junk filter”, we don't
> need JunkFilters. The targetlist that this projection computes is
> same as that of the result-relation-specific plan. Initially, I
> thought to generate this "expanded" targetlist in
> ExecInitModifyTable(). But as it can be somewhat expensive, doing it
> only once in the planner seemed like a good idea. These
> per-result-relations targetlists are carried in the ModifyTable node.
>
> To identify the result relation from the tuple produced by the plan,
> “tableoid” junk column will be used. As the tuples for different
> result relations won’t necessarily come out in the order in which
> result relations are laid out in the ModifyTable node, we need a way
> to map the tableoid value to result relation indexes. I have decided
> to use a hash table here.

Can we plan the scan query to add a sort node to order the rows by tableoid?

>
> A couple of things that I didn't think very hard what to do about now,
> but may revisit later.
>
> * We will no longer be able use DirectModify APIs to push updates to
> remote servers for foreign child result relations

If we convert a whole DML into partitionwise DML (just as it happens
today unintentionally), we should be able to use DirectModify. PWJ
will help there. But even we can detect that the scan underlying a
particular partition can be evaluated completely on the node same as
where the partition resides, we should be able to use DirectModify.
But if we are not able to support this optimization, the queries which
benefit from it for today won't perform well. I think we need to think
about this now instead of leave for later. Otherwise, make it so that
we use the old way when there are foreign partitions and new way
otherwise.

>
> * Over in [1], I have said that we get run-time pruning for free for
> ModifyTable because the plan we are using is same as that for SELECT,
> although now I think that I hadn't thought that through. With the PoC
> patch that I have:
>
> prepare q as update foo set a = 250001 where a = $1;
> set plan_cache_mode to 'force_generic_plan';
> explain execute q(1);
> QUERY PLAN
> --------------------------------------------------------------------
> Update on foo (cost=0.00..142.20 rows=40 width=14)
> Update on foo_1
> Update on foo_2 foo
> Update on foo_3 foo
> Update on foo_4 foo
> -> Append (cost=0.00..142.20 rows=40 width=14)
> Subplans Removed: 3
> -> Seq Scan on foo_1 (cost=0.00..35.50 rows=10 width=14)
> Filter: (a = $1)
> (9 rows)
>
> While it's true that we will never have to actually update foo_2,
> foo_3, and foo_4, ModifyTable still sets up its ResultRelInfos, which
> ideally it shouldn't. Maybe we'll need to do something about that
> after all.

* Tuple re-routing during UPDATE. For now it's disabled so your design
should work. But we shouldn't design this feature in such a way that
it comes in the way to enable tuple re-routing in future :).

--
Best Wishes,
Ashutosh Bapat

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 2020-05-11 13:20:35 Re: PG 13 release notes, first draft
Previous Message Alexander Korotkov 2020-05-11 12:56:03 Re: Concurrency bug in amcheck