Re: Parallel Full Hash Join

From: Melanie Plageman <melanieplageman(at)gmail(dot)com>
To: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc: Ian Lawrence Barwick <barwick(at)gmail(dot)com>, Justin Pryzby <pryzby(at)telsasoft(dot)com>, vignesh C <vignesh21(at)gmail(dot)com>, Greg Nancarrow <gregn4422(at)gmail(dot)com>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Parallel Full Hash Join
Date: 2023-03-30 19:23:15
Message-ID: CAAKRu_b+DoqUwfOJFNmp48gYsKkMJWw+damJ188n7wZdbGFw2A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Mar 27, 2023 at 7:04 PM Thomas Munro <thomas(dot)munro(at)gmail(dot)com> wrote:
> I found another problem. I realised that ... FULL JOIN ... LIMIT n
> might be able to give wrong answers with unlucky scheduling.
> Unfortunately I have been unable to reproduce the phenomenon I am
> imagining yet but I can't think of any mechanism that prevents the
> following sequence of events:
>
> P0 probes, pulls n tuples from the outer relation and then the
> executor starts to shut down (see commit 3452dc52 which pushed down
> LIMIT), but just then P1 attaches, right before P0 does. P1
> continues, and finds < n outer tuples while probing and then runs out
> so it enters the unmatched scan phase, and starts emitting bogusly
> unmatched tuples. Some outer tuples we needed to get the complete set
> of match bits and thus the right answer were buffered inside P0's
> subplan and abandoned.
>
> I've attached a simple fixup for this problem. Short version: if
> you're abandoning your PHJ_BATCH_PROBE phase without reaching the end,
> you must be shutting down, so the executor must think it's OK to
> abandon tuples this process has buffered, so it must also be OK to
> throw all unmatched tuples out the window too, as if this process was
> about to emit them. Right?

I understand the scenario you are thinking of, however, I question how
those incorrectly formed tuples would ever be returned by the query. The
hashjoin would only start to shutdown once enough tuples had been
emitted to satisfy the limit, at which point, those tuples buffered in
p0 may be emitted by this worker but wouldn't be included in the query
result, no?

I suppose even if what I said is true, we do not want the hashjoin node
to ever produce incorrect tuples. In which case, your fix seems correct to me.

> With all the long and abstract discussion of hard to explain problems
> in this thread and related threads, I thought I should take a step
> back and figure out a way to demonstrate what this thing really does
> visually. I wanted to show that this is a very useful feature that
> unlocks previously unobtainable parallelism, and to show the
> compromise we've had to make so far in an intuitive way. With some
> extra instrumentation hacked up locally, I produced the attached
> "progress" graphs for a very simple query: SELECT COUNT(*) FROM r FULL
> JOIN s USING (i). Imagine a time axis along the bottom, but I didn't
> bother to add numbers because I'm just trying to convey the 'shape' of
> execution with relative times and synchronisation points.
>
> Figures 1-3 show that phases 'h' (hash) and 'p' (probe) are
> parallelised and finish sooner as we add more processes to help out,
> but 's' (= the unmatched inner tuple scan) is not. Note that if all
> inner tuples are matched, 's' becomes extremely small and the
> parallelism is almost as fair as a plain old inner join, but here I've
> maximised it: all inner tuples were unmatched, because the two
> relations have no matches at all. Even if we achieve perfect linear
> scalability for the other phases, the speedup will be governed by
> https://en.wikipedia.org/wiki/Amdahl%27s_law and the only thing that
> can mitigate that is if there is more useful work those early-quitting
> processes could do somewhere else in your query plan.
>
> Figure 4 shows that it gets a lot fairer in a multi-batch join,
> because there is usually useful work to do on other batches of the
> same join. Notice how processes initially work on loading, probing
> and scanning different batches to reduce contention, but they are
> capable of ganging up to load and/or probe the same batch if there is
> nothing else left to do (for example P2 and P3 both work on p5 near
> the end). For now, they can't do that for the s phases. (BTW, the
> little gaps before loading is the allocation phase that I didn't
> bother to plot because they can't fit a label on them; this
> visualisation technique is a WIP.)
>
> With the "opportunistic" change we are discussing for later work,
> figure 4 would become completely fair (P0 and P2 would be able to join
> in and help out with s6 and s7), but single-batch figures 1-3 would
> not (that would require a different executor design AFAICT, or a
> eureka insight we haven't had yet; see long-winded discussion).

Cool diagrams!

> The last things I'm thinking about now: Are the planner changes
> right?

I think the current changes are correct. I wonder if we have to change
anything in initial/final_cost_hashjoin to account for the fact that
for a single batch full/right parallel hash join, part of the
execution is serial. And, if so, do we need to consider the estimated
number of unmatched tuples to be emitted?

> Are the tests enough?

So, the tests currently in the patch set cover the unmatched tuple scan
phase for single batch parallel full hash join. I've attached the
dumbest possible addition to that which adds in a multi-batch full
parallel hash join case. I did not do any checking to ensure I picked
the case which would add the least execution time to the test, etc.

Of course, this does leave the skip_unmatched code you added uncovered,
but I think if we had the testing infrastructure to test that, we would
be on a beach somewhere reading a book instead of beating our heads
against the wall trying to determine if there are any edge cases we are
missing in adding this feature.

- Melanie

Attachment Content-Type Size
Add-multi-batch-parallel-full-hash-join-test.patch text/x-patch 1.5 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Daniel Gustafsson 2023-03-30 19:26:47 Re: Should vacuum process config file reload more often
Previous Message Daniel Gustafsson 2023-03-30 19:19:17 Re: pgsql: Clean up role created in new subscription test.