Re: Avoiding hash join batch explosions with extreme skew and weird stats

From: Melanie Plageman <melanieplageman(at)gmail(dot)com>
To: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Avoiding hash join batch explosions with extreme skew and weird stats
Date: 2019-09-06 17:54:13
Message-ID: CAAKRu_Z-3vRCTsgnL_iXy-Y5e9-GyFTPwYdqpN+=1g860zb8eA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Sep 5, 2019 at 10:35 PM Thomas Munro <thomas(dot)munro(at)gmail(dot)com> wrote:

> Seems like a good time for me to try to summarise what I think the
> main problems are here:
>
> 1. The match-bit storage problem already discussed. The tuples that
> each process receives while reading from SharedTupleStore are
> non-deterministic (like other parallel scans). To use a bitmap-based
> approach, I guess we'd need to invent some way to give the tuples a
> stable identifier within some kind of densely packed number space that
> we could use to address the bitmap, or take the IO hit and write all
> the tuples back. That might involve changing the way SharedTupleStore
> holds data.
>

This I've dealt with by adding a tuplenum to the SharedTupleStore
itself which I atomically increment in sts_puttuple().
In ExecParallelHashJoinPartitionOuter(), as each worker writes tuples
to the batch files, they call sts_puttuple() and this increments the
number so each tuple has a unique number.
For persisting this number, I added the tuplenum to the meta data
section of the MinimalTuple (along with the hashvalue -- there was a
comment about this meta data that said it could be used for other
things in the future, so this seemed like a good place to put it) and
write that out to the batch file.

At the end of ExecParallelHashJoinPartitionOuter(), I make the outer
match status bitmap file. I use the final tuplenum count to determine
the number of bytes to write to it. Each worker has a file with a
bitmap which has the number of bytes required to represent the number
of tuples in that batch.

Because one worker may beat the other(s) and build the whole batch
file for a batch before the others have a chance, I also make the
outer match status bitmap file for workers who missed out in
ExecParallelHashJoinOuterGetTuple() using the final tuplenum as well.

>
> 2. Tricky problems relating to barriers and flow control. First, let
> me explain why PHJ doesn't support full/right outer joins yet. At
> first I thought it was going to be easy, because, although the shared
> memory hash table is read-only after it has been built, it seems safe
> to weaken that only slightly and let the match flag be set by any
> process during probing: it's OK if two processes clobber each other's
> writes, as the only transition is a single bit going strictly from 0
> to 1, and there will certainly be a full memory barrier before anyone
> tries to read those match bits. Then during the scan for unmatched,
> you just have to somehow dole out hash table buckets or ranges of
> buckets to processes on a first-come-first-served basis. But.... then
> I crashed into the following problem:
>
> * You can't begin the scan for unmatched tuples until every process
> has finished probing (ie until you have the final set of match bits).
> * You can't wait for every process to finish probing, because any
> process that has emitted a tuple might never come back if there is
> another node that is also waiting for all processes (ie deadlock
> against another PHJ doing the same thing), and probing is a phase that
> emits tuples.
>
> Generally, it's not safe to emit tuples while you are attached to a
> Barrier, unless you're only going to detach from it, not wait at it,
> because emitting tuples lets the program counter escape your control.
> Generally, it's not safe to detach from a Barrier while accessing
> resources whose lifetime it controls, such as a hash table, because
> then it might go away underneath you.
>
> The PHJ plans that are supported currently adhere to that programming
> rule and so don't have a problem: after the Barrier reaches the
> probing phase, processes never wait for each other again so they're
> free to begin emitting tuples. They just detach when they're done
> probing, and the last to detach cleans up (frees the hash table etc).
> If there is more than one batch, they detach from one batch and attach
> to another when they're ready (each batch has its own Barrier), so we
> can consider the batches to be entirely independent.
>
> There is probably a way to make a scan-for-unmatched-inner phase work,
> possibly involving another Barrier or something like that, but I ran
> out of time trying to figure it out and wanted to ship a working PHJ
> for the more common plan types. I suppose PHLJ will face two variants
> of this problem: (1) you need to synchronise the loops (you can't dump
> the hash table in preparation for the next loop until all have
> finished probing for the current loop), and yet you've already emitted
> tuples, so you're not allowed to wait for other processes and they're
> not allowed to wait for you, and (2) you can't start the
> scan-for-unmatched-outer until all the probe loops belonging to one
> batch are done. The first problem is sort of analogous to a problem I
> faced with batches in the first place, which Robert and I found a
> solution to by processing the batches in parallel, and could perhaps
> be solved in the same way: run the loops in parallel (if that sounds
> crazy, recall that every worker has its own quota of work_mem and the
> data is entirely prepartitioned up front, which is why we are able to
> run the batches in parallel; in constrast, single-batch mode makes a
> hash table with a quota of nparticipants * work_mem). The second
> problem is sort of analogous to the existing scan-for-unmatched-inner
> problem that I haven't solved.
>
>
I "solved" these problem for now by having all workers except for one
detach from the outer batch file after finishing probing. The last
worker to arrive does not detach from the batch and instead iterates
through all of the workers' outer match status files per participant
shared mem SharedTuplestoreParticipant) and create a single unified
bitmap. All the other workers continue to wait at the barrier until
the sole remaining worker has finished with iterating through the
outer match status bitmap files.

Admittedly, I'm still fighting with this step a bit, but, my intent is
to have all the backends wait until the lone remaining worker has
created the unified bitmap, then, that worker, which is still attached
to the outer batch will scan the outer batch file and the unified
outer match status bitmap and emit unmatched tuples.

I thought that the other workers can move on and stop waiting at the
barrier once the lone remaining worker has scanned their outer match
status files. All the probe loops would be done, and the worker that
is emitting tuples is not referencing the inner side hashtable at all
and only the outer batch file and the combined bitmap.

--
Melanie Plageman

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2019-09-06 18:11:12 Re: SQL-spec incompatibilities in similar_escape() and related stuff
Previous Message Robert Haas 2019-09-06 17:25:36 Re: ERROR: multixact X from before cutoff Y found to be still running