Re: Parallel Append implementation

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Amit Khandekar <amitdkhan(dot)pg(at)gmail(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Parallel Append implementation
Date: 2017-11-09 18:14:38
Message-ID: CA+TgmoZ3UGPn-1=kkFn4AVD-Zi=6i_T1H=mrsTvHzcgY7-3x3Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox
Thread:
Lists: pgsql-hackers

On Sat, Oct 28, 2017 at 5:50 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> No, because the Append node is *NOT* getting copied into shared
> memory. I have pushed a comment update to the existing functions; you
> can use the same comment for this patch.

I spent the last several days working on this patch, which had a
number of problems both cosmetic and functional. I think the attached
is in better shape now, but it could certainly use some more review
and testing since I only just finished modifying it, and I modified it
pretty heavily. Changes:

- I fixed the "morerows" entries in the documentation. If you had
built the documentation the way you had it and loaded up in a web
browser, you would have seen that the way you had it was not correct.

- I moved T_AppendState to a different position in the switch inside
ExecParallelReInitializeDSM, so as to keep that switch in the same
order as all of the other switch statements in that file.

- I rewrote the comment for pa_finished. It previously began with
"workers currently executing the subplan", which is not an accurate
description. I suspect this was a holdover from a previous version of
the patch in which this was an array of integers rather than an array
of type bool. I also fixed the comment in ExecAppendEstimate and
added, removed, or rewrote various other comments as well.

- I renamed PA_INVALID_PLAN to INVALID_SUBPLAN_INDEX, which I think is
more clear and allows for the possibility that this sentinel value
might someday be used for non-parallel-aware Append plans.

- I largely rewrote the code for picking the next subplan. A
superficial problem with the way that you had it is that you had
renamed exec_append_initialize_next to exec_append_seq_next but not
updated the header comment to match. Also, the logic was spread out
all over the file. There are three cases: not parallel aware, leader,
worker. You had the code for the first case at the top of the file
and the other two cases at the bottom of the file and used multiple
"if" statements to pick the right one in each case. I replaced all
that with a function pointer stored in the AppendState, moved the code
so it's all together, and rewrote it in a way that I find easier to
understand. I also changed the naming convention.

- I renamed pappend_len to pstate_len and ParallelAppendDescData to
ParallelAppendState. I think the use of the word "descriptor" is a
carryover from the concept of a scan descriptor. There's nothing
really wrong with inventing the concept of an "append descriptor", but
it seems more clear to just refer to shared state.

- I fixed ExecAppendReInitializeDSM not to reset node->as_whichplan.
Per commit 41b0dd987d44089dc48e9c70024277e253b396b7, that's wrong;
instead, local state should be reset in ExecReScanAppend. I installed
what I believe to be the correct logic in that function instead.

- I fixed list_qsort() so that it copies the type of the old list into
the new list. Otherwise, sorting a list of type T_IntList or
T_OidList would turn it into just plain T_List, which is wrong.

- I removed get_append_num_workers and integrated the logic into the
callers. This function was coded quite strangely: it assigned the
return value of fls() to a double and then eventually rounded the
result back to an integer. But fls() returns an integer, so this
doesn't make much sense. On a related note, I made it use fls(# of
subpaths) instead of fls(# of subpaths)+1. Adding 1 doesn't make
sense to me here because it leads to a decision to use 2 workers for a
single, non-partial subpath. I suspect both of these mistakes stem
from thinking that fls() returns the base-2 logarithm, but in fact it
doesn't, quite: log2(1) = 0.0 but fls(1) = 1.

- In the process of making the changes described in the previous
point, I added a couple of assertions, one of which promptly failed.
It turns out the reason is that your patch didn't update
accumulate_append_subpaths(), which can result in flattening
non-partial paths from a Parallel Append into a parent Append's list
of partial paths, which is bad. The easiest way to fix that would be
to just teach accumulate_append_subpaths() not to flatten a Parallel
Append into a parent Append or MergeAppend node, but it seemed to me
that there was a fair amount of duplication between
accumulate_partialappend_subpath() and accumulate_append_subpaths, so
what I did instead is folded all of the necessarily logic directly
into accumulate_append_subpaths(). This approach also avoids some
assumptions that your code made, such as the assumption that we will
never have a partial MergeAppend path.

- I changed create_append_path() so that it uses the parallel_aware
argument as the only determinant of whether the output path is flagged
as parallel-aware. Your version also considered whether
parallel_workers > 0, but I think that's not a good idea; the caller
should pass the correct value for parallel_aware rather than relying
on this function to fix it. Possibly you indirectly encountered the
problem mentioned in the previous point and worked around it like
this, or maybe there was some other reason, but it doesn't seem to be
necessary.

- I changed things around to enforce the rule that all partial paths
added to an appendrel must use the same row count estimate. (This is
not a new coding rule, but this patch provides a new way to violate
it.) I did that by forcing the row-count for any parallel append
mixing partial and non-partial paths to use the same row count as the
row already added. I also changed the way the row count is calculated
in the case where the only parallel append path mixes partial and
non-partial plans; I think this way is more consistent with what we've
done elsewhere. This amounts to the assumption that we're trying to
estimate the average number of rows per worker rather than the largest
possible number; I'm not sure what the best thing to do here is in
theory, but one advantage of this approach is that I think it will
produce answers closer to the value we get for an all-partial-paths
append. That's good, because we don't want the row-count estimate to
change precipitously based on whether an all-partial-paths append is
possible.

- I fixed some whitespace problems by running pgindent on various
files and manually breaking some long lines.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment Content-Type Size
parallel-append-v19.patch application/octet-stream 57.8 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Stephen Frost 2017-11-09 18:16:04 Re: Simplify ACL handling for large objects and removal of superuser() checks
Previous Message Robert Haas 2017-11-09 18:00:55 Re: pageinspect option to forgo buffer locking?