Re: pgsql: Add parallel-aware hash joins.

From: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Andres Freund <andres(at)anarazel(dot)de>, pgsql-committers <pgsql-committers(at)postgresql(dot)org>
Subject: Re: pgsql: Add parallel-aware hash joins.
Date: 2018-01-03 01:14:30
Message-ID: CAEepm=2WFQrwDD4rEVKjdPBQ_di_0n=xYST1MDRsucZ=JgDPxg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-committers pgsql-hackers

On Sun, Dec 31, 2017 at 1:00 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> Right. That's apparently unrelated and is the last build-farm issue
>> on my list (so far). I had noticed that certain BF animals are prone
>> to that particular failure, and they mostly have architectures that I
>> don't have so a few things are probably just differently sized. At
>> first I thought I'd tweak the tests so that the parameters were always
>> stable, and I got as far as installing Debian on qemu-system-ppc (it
>> took a looong time to compile PostgreSQL), but that seems a bit cheap
>> and flimsy... better to fix the size estimation error.
>
> "Size estimation error"? Why do you think it's that? We have exactly
> the same plan in both cases.

I mean that ExecChooseHashTableSize() estimates the hash table size like this:

inner_rel_bytes = ntuples * tupsize;

... but then at execution time, in the Parallel Hash case, we do
memory accounting not in tuples but in chunks. The various
participants pack tuples into 32KB chunks, and they trigger an
increase in the number of batches when the total size of all chunks
happens to exceeds the memory budget. In this case they do so
unexpectedly due to that extra overhead at execution time that the
planner didn't account for. We happened to be close to the threshold,
in this case between choosing 8 batches and 16 batches, we can get it
wrong and have to increase nbatch at execution time.

Non-parallel Hash also has such fragmentation. There are headers +
extra space at the end of each chunk, especially the end of the final
chunk. But it doesn't matter there because the executor doesn't count
the overhead either. For Parallel Hash I count the overhead, because
it reduces IPC if I do all the accounting in 32KB chunks.

I'm torn between (1) posting a patch that teaches
ExecChooseHashTableSize() to estimate the worst case extra
fragmentation assuming all participants contribute an almost entirely
empty chunk at the end, and (2) just finding some parameters (ie tweak
work_mem or number of tuples) that will make this work on all
computers in the build farm. I think the former is the correct
solution. Another solution would be to teach the executor to discount
the overhead, but that seems hard and seems like it's travelling in
the wrong direction.

> My guess is that what's happening is that one worker or the other ends
> up processing the whole scan, or the vast majority of it, so that that
> worker's hash table has to hold substantially more than half of the
> tuples and thereby is forced to up the number of batches. I don't see
> how you can expect to estimate that situation exactly; or if you do,
> you'll be pessimizing the plan for cases where the split is more nearly
> equal.

That sort of thing does indeed affect the size at execution. You can
see that run to run variation easily with a small join forced to use
Parallel Hash, so that there is a race to load tuples. You get a
larger size if more workers manage to load at least one tuple, due to
their final partially filled chunk.

There is also the question of this being underestimated on systems
without real atomics:

bucket_bytes = sizeof(HashJoinTuple) * nbuckets;

The real size at execution time is sizeof(dsa_pointer_atomic) *
nbuckets. I don't think that's responsible for this particular
underestimation problem because the bucket array is currently not
considered at execution time when deciding to increase batches -- it
should be, and I'll come back to those two problems separately.

--
Thomas Munro
http://www.enterprisedb.com

In response to

Responses

Browse pgsql-committers by date

  From Date Subject
Next Message Tom Lane 2018-01-03 01:38:30 Re: pgsql: Add parallel-aware hash joins.
Previous Message Peter Eisentraut 2018-01-02 17:29:33 pgsql: Don't cast between GinNullCategory and bool

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2018-01-03 01:20:19 Re: Ensuring hash tuples are properly maxaligned
Previous Message Joshua D. Drake 2018-01-03 01:11:55 Re: CFM for January commitfest?