Quick Links

Re: making update/delete of inheritance trees scale better

From:	Amit Langote <amitlangote09(at)gmail(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: making update/delete of inheritance trees scale better
Date:	2021-03-26 15:21:44
Message-ID:	CA+HiwqGb9Y2zgy7bSvp0M7y-_f94VwXLrrSL_z=2L_PMe=k35Q@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Wed, Mar 24, 2021 at 1:46 AM Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Wed, Mar 3, 2021 at 9:39 AM Amit Langote <amitlangote09(at)gmail(dot)com> wrote:
> > Just noticed that a test added by the recent 927f453a941 fails due to
> > 0002. We no longer allow moving a row into a postgres_fdw partition
> > if it is among the UPDATE's result relations, whereas previously we
> > would if the UPDATE on that partition is already finished.
> >
> > To fix, I've adjusted the test case. Attached updated version.
>
> I spent some time studying this patch this morning.

Thanks a lot for your time on this.

> As far as I can
> see, 0001 is a relatively faithful implementation of the design Tom
> proposed back in early 2019. I think it would be nice to either get
> this committed or else decide that we don't want it and what we're
> going to try to do instead, because we can't make UPDATEs and DELETEs
> stop sucking with partitioned tables until we settle on some solution
> to the problem that is inheritance_planner(), and that strikes me as
> an *extremely* important problem. Lots of people are trying to use
> partitioning in PostgreSQL, and they don't like finding out that, in
> many cases, it makes things slower rather than faster. Neither this
> nor any other patch is going to solve that problem in general, because
> there are *lots* of things that haven't been well-optimized for
> partitioning yet. But, this is a pretty important case that we should
> really try to do something about.
>
> So, that said, here are some random comments:
>
> - I think it would be interesting to repeat your earlier benchmarks
> using -Mprepared. One question I have is whether we're saving overhead
> here during planning at the price of execution-time overhead, or
> whether we're saving during both planning and execution.

Please see at the bottom of this reply.

> - Until I went back and found your earlier remarks on this thread, I
> was confused as to why you were replacing a JunkFilter with a
> ProjectionInfo. I think it would be good to try to add some more
> explicit comments about that design choice, perhaps as header comments
> for ExecGetUpdateNewTuple, or maybe there's a better place.

I think the comments around ri_projectNew that holds the
ProjectionInfo node explains this to some degree, especially the
comment in ExecInitModifyTable() that sets it. I don't particularly
see a need to go into detail why JunkFilter is not suitable for the
task if we're no longer using it at all in nodeModifyTable.c.

> I'm still
> not sure why we need to do the same thing for the insert case, which
> seems to just be about removing junk columns.

I think I was hesitant to have both a ri_junkFilter and ri_projectNew
catering for inserts and update/delete respectively.

> At least in the non-JIT
> case, it seems to me that ExecJunkFilter() should be cheaper than
> ExecProject(). Is it different enough to matter? Does the fact that we
> can JIT the ExecProject() work make it actually faster? These are
> things I don't know.

ExecJunkFilter() indeed looks cheaper on a first look for simple junk
filtering, but as Tom also found out, there's actually no test case
involving INSERT to do the actual performance comparison with.

> - There's a comment which you didn't write but just moved which I find
> to be quite confusing. It says "For UPDATE/DELETE, find the
> appropriate junk attr now. Typically, this will be a 'ctid' or
> 'wholerow' attribute, but in the case of a foreign data wrapper it
> might be a set of junk attributes sufficient to identify the remote
> row." But, the block of code which follows caters only to the 'ctid'
> and 'wholerow' cases, not anything else. Perhaps that's explained by
> the comment a bit further down which says "When there is a row-level
> trigger, there should be a wholerow attribute," but if the point is
> that this code is only reached when there's a row-level trigger,
> that's far from obvious. It *looks* like something we'd reach for ANY
> insert or UPDATE case. Maybe it's not your job to do anything about
> this confusion, but I thought it was worth highlighting.

I do remember being confused by that note regarding the junk
attributes required by FDWs for their result relations when I first
saw it, but eventually found out that it's talking about the
information about junk attributes that FDWs track in their *private*
data structures. For example, postgres_fdw uses
PgFdwModifyState.ctidAttno to record the index of the "ctid" TLE in
the source plan's targetlist. It is used, for example, by
postgresExecForeignUpdate() to extract the ctid from the plan tuple
passed to it and pass the value as parameter for the remote query:
update remote_tab set ... where ctid = $1.

I've clarified the comment to make that a bit clear.

> - The comment for filter_junk_tlist_entries(), needless to say, is of
> the highest quality,

Sorry, it was a copy-paste job.

> but would it make any sense to think about having
> an option for expand_tlist() to omit the junk entries itself, to avoid
> extra work? I'm unclear whether we want that behavior in all UPDATE
> cases or only some of them, because preproces_targetlist() has a call
> to expand_tlist() to set parse->onConflict->onConflictSet that does
> not call filter_junk_tlist_entries() on the result.

I added an exclude_junk parameter to expand_targetlist() and passed
false for it in all sites except make_update_tlist(), including where
it's called on parse->onConflict->onConflictSet.

Although, make_update_tlist() and related code may have been
superseded by Tom's WIP patch.

> Does this patch
> need to make any changes to the handling of ON CONFLICT .. UPDATE? It
> looks to me like right now it doesn't, but I don't know whether that's
> an oversight or intentional.

I intentionally didn't bother with changing any part of the ON
CONFLICT UPDATE case, mainly because INSERTs don't have a
inheritance_planner() problem. We may want to revisit that in the
future if we decide to revise the ExecUpdate() API to not pass the
fully-reconstructed new tuple, which this patch doesn't do.

> - The output changes in the main regression test suite are pretty easy
> to understand: we're just seeing columns that no longer need to get
> fed through the execution get dropped. The changes in the postgres_fdw
> results are harder to understand. In general, it appears that what's
> happening is that we're no longer outputting the non-updated columns
> individually -- which makes sense -- but now we're outputting a
> whole-row var that wasn't there before, e.g.:
>
> - Output: foreign_tbl.a, (foreign_tbl.b + 15), foreign_tbl.ctid
> + Output: (foreign_tbl.b + 15), foreign_tbl.ctid, foreign_tbl.*
>
> Since this is postgres_fdw, we can re-fetch the row using CTID, so
> it's not clear to me why we need foreign_tbl.* when we didn't before.
> Maybe the comments need improvement.

ExecForeignUpdate FDW API expects being passed a fully-formed new
tuple, even though it will typically only access the changed columns
from that tuple to pass in the remote update query. There is a
comment in rewriteTargetListUD() to explain this, which I have updated
somewhat to read as follows:

/*
* ExecUpdate() needs to pass a full new tuple to be assigned to the
* result relation to ExecForeignUpdate(), although the plan will have
* produced values for only the changed columns. Here we ask the FDW
* to fetch wholerow to serve as the side channel for getting the
* values of the unchanged columns when constructing the full tuple to
* be passed to ExecForeignUpdate(). Actually, we only really need
* this for UPDATEs that are not pushed to the remote side, but whether
* or not the pushdown will occur is not clear when this function is
* called, so we ask for wholerow anyway.
*
* We will also need the "old" tuple if there are any row triggers.
*/

> - Specifically, I think the comments in preptlist.c need some work.
> You've edited the top-of-file comment, but I don't think it's really
> accurate. The first sentence says "For INSERT and UPDATE, the
> targetlist must contain an entry for each attribute of the target
> relation in the correct order," but I don't think that's really true
> any more. It's certainly not what you see in the EXPLAIN output. The
> paragraph goes on to explain that UPDATE has a second target list, but
> (a) that seems to contradict the first sentence and (b) that second
> target list isn't what you see when you run EXPLAIN. Also, there's no
> mention of what happens for FDWs here, but it's evidently different,
> as per the previous review comment.

It seems Tom has other things in mind for what I've implemented as
update_tlist, so I will leave this alone.

> - The comments atop fix_join_expr() should be updated. Maybe you can
> just adjust the wording for case #2.

Apparently the changes in setrefs.c are being thrown out as well in
Tom's patch, so likewise I will leave this alone.

Attached updated version of the patch. I have forgotten to mention in
my recent posts on this thread one thing about 0001 that I had
mentioned upthread back in June. That it currently fails a test in
postgres_fdw's suite due to a bug of cross-partition updates that I
decided at the time to pursue in another thread:
https://www.postgresql.org/message-id/CA%2BHiwqE_UK1jTSNrjb8mpTdivzd3dum6mK--xqKq0Y9VmfwWQA%40mail.gmail.com

That bug is revealed due to some changes that 0001 makes. However, it
does not matter after applying 0002, because the current way of having
one plan per result relation is a precondition for that bug to
manifest. So, if we are to apply only 0001 first, then I'm afraid we
would have take care of that bug before applying 0001.

Finally, here are the detailed results of the benchmarks I redid to
check the performance implications of doing UPDATEs the new way,
comparing master and 0001.

Repeated 2 custom pgbench tests against the UPDATE target tables
containing 10, 20, 40, and 80 columns. The 2 custom tests are as
follows:

nojoin:

\set a random(1, 1000000)
update test_table t set b = :a where a = :a;

join:

\set a random(1, 1000000)
update test_table t set b = foo.b from foo where t.a = foo.a and foo.b = :a;

foo has just 2 integer columns a, b, with an index on b.

Checked using both -Msimple and -Mprepared this time, whereas I had
only checked the former the last time.

I'd summarize the results I see as follows:

In -Msimple mode, patched wins by a tiny margin for both nojoin and
join cases at 10, 20 columns, and by slightly larger margin at 40, 80
columns with the join case showing bigger margin than nojoin.

In -Mprepared mode, where the numbers are a bit noisy, I can only tell
clearly that the patched wins by a very wide margin for the join case
at 40, 80 columns, without a clear winner in other cases.

To answer Robert's questions in this regard:

> One question I have is whether we're saving overhead
> here during planning at the price of execution-time overhead, or
> whether we're saving during both planning and execution.

Smaller targetlists due to the patch at least help the patched end up
on the better side of tps comparison. Maybe this aspect helps reduce
both the planning and execution time. As for whether the results
reflect negatively on the fact that we now fetch the tuple one more
time to construct the new tuple, I don't quite see that to be the
case.

Raw tps figures (each case repeated 3 times) follow. I'm also
attaching (a hopefully self-contained) shell script file
(test_update.sh) that you can run to reproduce the numbers for the
various cases.

10 columns

nojoin simple master
tps = 12278.749205 (without initial connection time)
tps = 11537.051718 (without initial connection time)
tps = 12312.717990 (without initial connection time)
nojoin simple patched
tps = 12160.125784 (without initial connection time)
tps = 12170.271905 (without initial connection time)
tps = 12212.037774 (without initial connection time)

nojoin prepared master
tps = 12228.149183 (without initial connection time)
tps = 12509.135100 (without initial connection time)
tps = 11698.161145 (without initial connection time)
nojoin prepared patched
tps = 13033.005860 (without initial connection time)
tps = 14690.203013 (without initial connection time)
tps = 15083.096511 (without initial connection time)

join simple master
tps = 9112.059568 (without initial connection time)
tps = 10730.739559 (without initial connection time)
tps = 10663.677821 (without initial connection time)
join simple patched
tps = 10980.139631 (without initial connection time)
tps = 10887.743691 (without initial connection time)
tps = 10929.663379 (without initial connection time)

join prepared master
tps = 21333.421825 (without initial connection time)
tps = 23895.538826 (without initial connection time)
tps = 24761.384786 (without initial connection time)
join prepared patched
tps = 25665.062858 (without initial connection time)
tps = 25037.391119 (without initial connection time)
tps = 25421.839842 (without initial connection time)

20 columns

nojoin simple master
tps = 11215.161620 (without initial connection time)
tps = 11306.536537 (without initial connection time)
tps = 11310.776393 (without initial connection time)
nojoin simple patched
tps = 11791.107767 (without initial connection time)
tps = 11757.933141 (without initial connection time)
tps = 11743.983647 (without initial connection time)

nojoin prepared master
tps = 17144.510719 (without initial connection time)
tps = 14032.133587 (without initial connection time)
tps = 15678.801224 (without initial connection time)
nojoin prepared patched
tps = 16603.131255 (without initial connection time)
tps = 14703.564675 (without initial connection time)
tps = 13652.827905 (without initial connection time)

join simple master
tps = 9637.904229 (without initial connection time)
tps = 9869.163480 (without initial connection time)
tps = 9865.673335 (without initial connection time)
join simple patched
tps = 10779.705826 (without initial connection time)
tps = 10790.961520 (without initial connection time)
tps = 10917.759963 (without initial connection time)

join prepared master
tps = 23030.120609 (without initial connection time)
tps = 22347.620338 (without initial connection time)
tps = 24227.376933 (without initial connection time)
join prepared patched
tps = 22303.689184 (without initial connection time)
tps = 24507.395745 (without initial connection time)
tps = 25219.535413 (without initial connection time)

40 columns

nojoin simple master
tps = 10348.352638 (without initial connection time)
tps = 9978.449528 (without initial connection time)
tps = 10024.132430 (without initial connection time)
nojoin simple patched
tps = 10169.485989 (without initial connection time)
tps = 10239.297780 (without initial connection time)
tps = 10643.076675 (without initial connection time)

nojoin prepared master
tps = 13606.361325 (without initial connection time)
tps = 15815.149553 (without initial connection time)
tps = 15940.675165 (without initial connection time)
nojoin prepared patched
tps = 13889.450942 (without initial connection time)
tps = 13406.879350 (without initial connection time)
tps = 15640.326344 (without initial connection time)

join simple master
tps = 9235.503480 (without initial connection time)
tps = 9244.756832 (without initial connection time)
tps = 8785.542317 (without initial connection time)
join simple patched
tps = 10106.285796 (without initial connection time)
tps = 10375.248536 (without initial connection time)
tps = 10357.087162 (without initial connection time)

join prepared master
tps = 18795.665779 (without initial connection time)
tps = 17650.815736 (without initial connection time)
tps = 20903.206602 (without initial connection time)
join prepared patched
tps = 24706.505207 (without initial connection time)
tps = 22867.751793 (without initial connection time)
tps = 23589.244380 (without initial connection time)

80 columns

nojoin simple master
tps = 8281.679334 (without initial connection time)
tps = 7517.657106 (without initial connection time)
tps = 8509.366647 (without initial connection time)
nojoin simple patched
tps = 9200.437258 (without initial connection time)
tps = 9349.939671 (without initial connection time)
tps = 9128.197101 (without initial connection time)

nojoin prepared master
tps = 12975.410783 (without initial connection time)
tps = 13486.858443 (without initial connection time)
tps = 10994.355244 (without initial connection time)
nojoin prepared patched
tps = 14266.725696 (without initial connection time)
tps = 15250.258418 (without initial connection time)
tps = 13356.236075 (without initial connection time)

join simple master
tps = 7678.440018 (without initial connection time)
tps = 7699.796166 (without initial connection time)
tps = 7880.407359 (without initial connection time)
join simple patched
tps = 9552.413096 (without initial connection time)
tps = 9469.579290 (without initial connection time)
tps = 9584.026033 (without initial connection time)

join prepared master
tps = 18390.262404 (without initial connection time)
tps = 18754.121500 (without initial connection time)
tps = 20355.875827 (without initial connection time)
join prepared patched
tps = 24041.648927 (without initial connection time)
tps = 22510.192030 (without initial connection time)
tps = 21825.870402 (without initial connection time)

--
Amit Langote
EDB: http://www.enterprisedb.com

Attachment	Content-Type	Size
v11-0002-Revise-how-inherited-update-delete-are-handled.patch	application/octet-stream	204.7 KB
v11-0001-Overhaul-how-updates-compute-new-tuple.patch	application/octet-stream	79.1 KB
test_update.sh	text/x-sh	1.3 KB

In response to

Re: making update/delete of inheritance trees scale better at 2021-03-23 16:46:40 from Robert Haas

Responses

Re: making update/delete of inheritance trees scale better at 2021-03-27 16:30:07 from Tom Lane

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	蔡梦娟 (玊于)	2021-03-26 15:44:21	Bug on update timing of walrcv->flushedUpto variable
Previous Message	Tomas Vondra	2021-03-26 14:45:34	Re: WIP: BRIN multi-range indexes