Re: making update/delete of inheritance trees scale better

From: Amit Langote <amitlangote09(at)gmail(dot)com>
To: Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: making update/delete of inheritance trees scale better
Date: 2020-05-11 14:41:30
Message-ID: CA+HiwqEijMeLXLzSP-++4sfPJ_4L8aELKgtbrGmBR31P1AsTCg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi Ashutosh,

Thanks for chiming in.

On Mon, May 11, 2020 at 9:58 PM Ashutosh Bapat
<ashutosh(dot)bapat(dot)oss(at)gmail(dot)com> wrote:
> On Fri, May 8, 2020 at 7:03 PM Amit Langote <amitlangote09(at)gmail(dot)com> wrote:
> > I do think that doing this would be worthwhile even if we may be
> > increasing ModifyTable's per-row overhead slightly, because planning
> > overhead of the current approach is very significant, especially for
> > partition trees with beyond a couple of thousand partitions. As to
> > how bad the problem is, trying to create a generic plan for `update
> > foo set ... where key = $1`, where foo has over 2000 partitions,
> > causes OOM even on a machine with 6GB of memory.
>
> Per row overhead would be incurred for every row whereas the plan time
> overhead is one-time or in case of a prepared statement almost free.
> So we need to compare it esp. when there are 2000 partitions and all
> of them are being updated.

I assume that such UPDATEs would be uncommon.

> But generally I agree that this would be a
> better approach. It might help using PWJ when the result relation
> joins with other partitioned table.

It does, because the plan below ModifyTable is same as if the query
were SELECT instead of UPDATE; with my PoC:

explain (costs off) update foo set a = foo2.a + 1 from foo foo2 where
foo.a = foo2.a;
QUERY PLAN
--------------------------------------------------
Update on foo
Update on foo_1
Update on foo_2
-> Append
-> Merge Join
Merge Cond: (foo_1.a = foo2_1.a)
-> Sort
Sort Key: foo_1.a
-> Seq Scan on foo_1
-> Sort
Sort Key: foo2_1.a
-> Seq Scan on foo_1 foo2_1
-> Merge Join
Merge Cond: (foo_2.a = foo2_2.a)
-> Sort
Sort Key: foo_2.a
-> Seq Scan on foo_2
-> Sort
Sort Key: foo2_2.a
-> Seq Scan on foo_2 foo2_2
(20 rows)

as opposed to what you get today:

explain (costs off) update foo set a = foo2.a + 1 from foo foo2 where
foo.a = foo2.a;
QUERY PLAN
--------------------------------------------------
Update on foo
Update on foo_1
Update on foo_2
-> Merge Join
Merge Cond: (foo_1.a = foo2.a)
-> Sort
Sort Key: foo_1.a
-> Seq Scan on foo_1
-> Sort
Sort Key: foo2.a
-> Append
-> Seq Scan on foo_1 foo2
-> Seq Scan on foo_2 foo2_1
-> Merge Join
Merge Cond: (foo_2.a = foo2.a)
-> Sort
Sort Key: foo_2.a
-> Seq Scan on foo_2
-> Sort
Sort Key: foo2.a
-> Append
-> Seq Scan on foo_1 foo2
-> Seq Scan on foo_2 foo2_1
(23 rows)

> > For child result relations that are foreign tables, their FDW adds
> > junk attribute(s) to the query’s targetlist by updating it in-place
> > (AddForeignUpdateTargets). However, as the child tables will no
> > longer get their own parsetree, we must use some hack around this
> > interface to obtain the foreign table specific junk attributes and add
> > them to the original/parent query’s targetlist. Assuming that all or
> > most of the children will belong to the same FDW, we will end up with
> > only a handful such junk columns in the final targetlist. I am not
> > sure if it's worthwhile to change the API of AddForeignUpdateTargets
> > to require FDWs to not scribble on the passed-in parsetree as part of
> > this patch.
>
> What happens if there's a mixture of foreign and local partitions or
> mixture of FDWs? Injecting junk columns from all FDWs in the top level
> target list will cause error because those attributes won't be
> available everywhere.

That is a good question and something I struggled with ever since I
started started thinking about implementing this.

For the problem that FDWs may inject junk columns that could neither
be present in local tables (root parent and other local children) nor
other FDWs, I couldn't think of any solution other than to restrict
what those junk columns can be -- to require them to be either "ctid",
"wholerow", or a set of only *inherited* user columns. I think that's
what Tom was getting at when he said the following in the email I
cited in my first email:

"...It gets a bit harder if the tree contains some foreign tables,
because they might have different concepts of row identity, but I'd
think in most cases you could still combine those into a small number
of output columns."

Maybe I misunderstood what Tom said, but I can't imagine how to let
these junk columns be anything that *all* tables contained in an
inheritance tree, especially the root parent, cannot emit, if they are
to be emitted out of a single plan.

> > As for how ModifyTable will create the new tuple for updates, I have
> > decided to use a ProjectionInfo for each result relation, which
> > projects a full, *clean* tuple ready to be put back into the relation.
> > When projecting, plan’s output tuple serves as OUTER tuple and the old
> > tuple fetched to fill unassigned attributes serves as SCAN tuple. By
> > having this ProjectionInfo also serve as the “junk filter”, we don't
> > need JunkFilters. The targetlist that this projection computes is
> > same as that of the result-relation-specific plan. Initially, I
> > thought to generate this "expanded" targetlist in
> > ExecInitModifyTable(). But as it can be somewhat expensive, doing it
> > only once in the planner seemed like a good idea. These
> > per-result-relations targetlists are carried in the ModifyTable node.
> >
> > To identify the result relation from the tuple produced by the plan,
> > “tableoid” junk column will be used. As the tuples for different
> > result relations won’t necessarily come out in the order in which
> > result relations are laid out in the ModifyTable node, we need a way
> > to map the tableoid value to result relation indexes. I have decided
> > to use a hash table here.
>
> Can we plan the scan query to add a sort node to order the rows by tableoid?

Hmm, I am afraid that some piece of partitioning code that assumes a
certain order of result relations, and that order is not based on
sorting tableoids.

> > A couple of things that I didn't think very hard what to do about now,
> > but may revisit later.
> >
> > * We will no longer be able use DirectModify APIs to push updates to
> > remote servers for foreign child result relations
>
> If we convert a whole DML into partitionwise DML (just as it happens
> today unintentionally), we should be able to use DirectModify. PWJ
> will help there. But even we can detect that the scan underlying a
> particular partition can be evaluated completely on the node same as
> where the partition resides, we should be able to use DirectModify.

I remember Fujita-san mentioned something like this, but I haven't
looked into how feasible it would be given the current DirectModify
interface.

> But if we are not able to support this optimization, the queries which
> benefit from it for today won't perform well. I think we need to think
> about this now instead of leave for later. Otherwise, make it so that
> we use the old way when there are foreign partitions and new way
> otherwise.

I would very much like find a solution for this, which hopefully isn't
to fall back to using the old way.

> * Tuple re-routing during UPDATE. For now it's disabled so your design
> should work. But we shouldn't design this feature in such a way that
> it comes in the way to enable tuple re-routing in future :).

Sorry, what is tuple re-routing and why does this new approach get in its way?

--
Amit Langote
EnterpriseDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2020-05-11 16:25:54 Re: 2pc leaks fds
Previous Message Tom Lane 2020-05-11 14:40:23 Re: Add "-Wimplicit-fallthrough" to default flags (was Re: pgsql: Support FETCH FIRST WITH TIES)