Re: Optimization for updating foreign tables in Postgres FDW

From: Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Rushabh Lathia <rushabh(dot)lathia(at)gmail(dot)com>, Thom Brown <thom(at)linux(dot)com>, Stephen Frost <sfrost(at)snowman(dot)net>, Albe Laurenz <laurenz(dot)albe(at)wien(dot)gv(dot)at>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Shigeru Hanada <shigeru(dot)hanada(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Optimization for updating foreign tables in Postgres FDW
Date: 2016-03-07 12:53:23
Message-ID: 56DD79C3.3070004@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2016/03/05 5:45, Robert Haas wrote:
> Some comments on the latest version. I haven't reviewed the
> postgres_fdw changes in detail here, so this is just about the core
> changes.

Thank you for taking the time to review the patch!

> I see that show_plan_tlist checks whether the operation is any of
> CMD_INSERT, CMD_UPDATE, or CMD_DELETE. But practically every place
> else where a similar test is needed instead tests whether the
> operation is *not* CMD_SELECT. I think this place should do it that
> way, too.
>
> + resultRelInfo = mtstate->resultRelInfo;
> for (i = 0; i < nplans; i++)
> {
> ExecAuxRowMark *aerm;
>
> + /*
> + * ignore subplan if the FDW pushes down the
> command to the remote
> + * server; the ModifyTable won't have anything
> to do except for
> + * evaluation of RETURNING expressions
> + */
> + if (resultRelInfo->ri_FdwPushdown)
> + {
> + resultRelInfo++;
> + continue;
> + }
> +
> subplan = mtstate->mt_plans[i]->plan;
> aerm = ExecBuildAuxRowMark(erm, subplan->targetlist);
> mtstate->mt_arowmarks[i] =
> lappend(mtstate->mt_arowmarks[i], aerm);
> + resultRelInfo++;
> }
>
>
> This kind of thing creates a hazard for future people maintaining this
> code. If somebody adds some code to this loop that needs to execute
> even when resultRelInfo->ri_FdwPushdown is true, they have to add two
> copies of it. It's much better to move the three lines of logic that
> execute only in the non-pushdown case inside of if
> (!resultRelInfo->ri_FdwPushdown).

Another option to avoid such a hazard would be to remove the two changes
from ExecInitModifyTable and create ExecAuxRowMarks and junk filters
even in the pushdown case. I made the changes because we won't use
ExecAuxRowMarks in that case since we don't need to do EvalPlanQual
rechecks and because we won't use junk filters in that case since we do
UPDATE/DELETE in the subplan. But the creating cost is enough small, so
simply removing the changes seems like a good idea.

> This issue crops up elsewhere as well. The changes to
> ExecModifyTable() have the same problem -- in that case, it might be
> wise to move the code that's going to have to be indented yet another
> level into a separate function. That code also says this:
>
> + /* No need to provide scan tuple to
> ExecProcessReturning. */
> + slot = ExecProcessReturning(resultRelInfo,
> NULL, planSlot);
>
> ...but, uh, why not? The comment says what the code does, but what it
> should do is explain why it does it.

As documented in IterateDMLPushdown in fdwhandler.sgml, the reason for
that is that in the pushdown case it's the IterateDMLPushdown's
responsiblity to get actually inserted/updated/deleted tuples and make
those tuples available to the ExecProcessReturning. I'll add comments.

> On a broader level, I'm not very happy with the naming this patch
> uses. Here's an example:
>
> + <para>
> + If an FDW supports optimizing foreign table updates, it still needs to
> + provide <function>PlanDMLPushdown</>, <function>BeginDMLPushdown</>,
> + <function>IterateDMLPushdown</> and <function>EndDMLPushdown</>
> + described below.
> + </para>
>
> "Optimizing foreign table updates" is both inaccurate (since it
> doesn't only optimize updates) and so vague as to be meaningless
> unless you already know what it means. The actual patch uses
> terminology like "fdwPushdowns" which is just as bad. We might push a
> lot of things to the foreign side -- sorts, joins, aggregates, limits
> -- and this is just one of them. Worse, "pushdown" is itself
> something of a term of art - will people who haven't been following
> all of the mammoth, multi-hundred-email threads on this topic know
> what that means? I think we need some better terminology here.
>
> The best thing that I can come up with offhand is "bulk modify". So
> we'd have PlanBulkModify, BeginBulkModify, IterateBulkModify,
> EndBulkModify, ExplainBulkModify. Other suggestions welcome. The
> ResultRelInfo flag could be ri_usesFDWBulkModify.

I'm not sure that "bulk modify" is best. Yeah, this would improve the
performance especially in the bulk-modification case, but would improve
the performance even in the case where an UPDATE/DELETE modifies just a
single row. Let me explain using an example. Without the patch, we
have the following plan for an UPDATE on a foreign table that updates a
single row:

postgres=# explain verbose update foo set a = a + 1 where a = 1;
QUERY PLAN
----------------------------------------------------------------------------------
Update on public.foo (cost=100.00..101.05 rows=1 width=14)
Remote SQL: UPDATE public.foo SET a = $2 WHERE ctid = $1
-> Foreign Scan on public.foo (cost=100.00..101.05 rows=1 width=14)
Output: (a + 1), b, ctid
Remote SQL: SELECT a, b, ctid FROM public.foo WHERE ((a = 1))
FOR UPDATE
(5 rows)

The plan requires two queries, SELECT and UPDATE, to do the update.
(Actually, the plan have additional overheads in creating a cursor for
the SELECT and establishing a prepared statement for the UPDATE.) But
with the patch, we have:

postgres=# explain verbose update foo set a = a + 1 where a = 1;
QUERY PLAN
---------------------------------------------------------------------------
Update on public.foo (cost=100.00..101.05 rows=1 width=14)
-> Foreign Update on public.foo (cost=100.00..101.05 rows=1 width=14)
Remote SQL: UPDATE public.foo SET a = (a + 1) WHERE ((a = 1))
(3 rows)

The optimized plan requires just a single UPDATE query to do that! So,
even in the single-row-modification case the patch could improve the
performance.

How about "Direct Modify"; PlanDirectModify, BeginDirectModify,
IterateDirectModify, EndDirectModify, ExplainDirectModify, and
ri_usesFDWDirectModify.

> The documentation
> could say something like this:
>
> Some inserts, updates, and deletes to foreign tables can be optimized
> by implementing an alternate set of interfaces. The ordinary
> interfaces for inserts, updates, and deletes fetch rows from the
> remote server and then modify those rows one at a time. In some
> cases, this row-by-row approach is necessary, but it can be
> inefficient. If it is possible for the foreign server to determine
> which rows should be modified without actually retrieving them, and if
> there are no local triggers which would affect the operation, then it
> is possible to arrange things so that the entire operation is
> performed on the remote server. The interfaces described below make
> this possible.

Will update as proposed.

> + Begin executing a foreign table update directly on the remote server.
>
> I think this should say "Prepare to execute a bulk modification
> directly on the remote server". It shouldn't actually begin the
> execution phase.
>
> + End the table update and release resources. It is normally not important
>
> And I think this one should say "Clean up following a bulk
> modification on the remote server". It's not actually ending the
> update; the iterate method already did that.

OK, will fix.

Best regards,
Etsuro Fujita

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message MauMau 2016-03-07 12:55:07 Re: How can we expand PostgreSQL ecosystem?
Previous Message Craig Ringer 2016-03-07 12:25:25 Re: How can we expand PostgreSQL ecosystem?