Re: Avoiding hash join batch explosions with extreme skew and weird stats

From: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Avoiding hash join batch explosions with extreme skew and weird stats
Date: 2019-05-16 22:21:56
Message-ID: CA+hUKG++SLmA5KRTBLHf9UFomjvsaQSMaMatV7KneWbLmT72YA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, May 17, 2019 at 4:39 AM Tomas Vondra
<tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
> I think this is a step in the right direction, but as I said on the other
> thread(s), I think we should not disable growth forever and recheck once
> in a while. Otherwise we'll end up in sad situation with non-uniform data
> sets, as poined out by Hubert Zhang in [1]. It's probably even truer with
> this less strict logic, using 95% as a threshold (instead of 100%).
>
> I kinda like the idea with increasing the spaceAllowed value. Essentially,
> if we decide adding batches would be pointless, increasing the memory
> budget is the only thing we can do anyway.

But that's not OK, we need to fix THAT.

> The problem however is that we only really look at a single bit - it may
> be that doubling the batches would not help, but doing it twice would
> actually reduce the memory usage. For example, assume there are 2 distinct
> values in the batch, with hash values (in binary)

Yes, that's a good point, and not a case that we should ignore. But
if we had a decent fall-back strategy that respected work_mem, we
wouldn't care so much if we get it wrong in a corner case. I'm
arguing that we should use Grace partitioning as our primary
partitioning strategy, but fall back to looping (or possibly
sort-merging) for the current batch if Grace doesn't seem to be
working. You'll always be able to find cases where if you'd just
tried one more round, Grace would work, but that seems acceptable to
me, because getting it wrong doesn't melt your computer, it just
probably takes longer. Or maybe it doesn't. How much longer would it
take to loop twice? Erm, twice as long, and each loop makes actual
progress, unlike extra speculative Grace partition expansions which
apply not just to the current batch but all batches, might not
actually work, and you *have* to abandon at some point. The more I
think about it, the more I think that a loop-base escape valve, though
unpalatably quadratic, is probably OK because we're in a sink-or-swim
situation at this point, and our budget is work_mem, not work_time.

I'm concerned that we're trying to find ways to treat the symptoms,
allowing us to exceed work_mem but maybe not so much, instead of
focusing on the fundamental problem, which is that we don't yet have
an algorithm that is guaranteed to respect work_mem.

Admittedly I don't have a patch, just a bunch of handwaving. One
reason I haven't attempted to write it is because although I know how
to do the non-parallel version using a BufFile full of match bits in
sync with the tuples for outer joins, I haven't figured out how to do
it for parallel-aware hash join, because then each loop over the outer
batch could see different tuples in each participant. You could use
the match bit in HashJoinTuple header, but then you'd have to write
all the tuples out again, which is more IO than I want to do. I'll
probably start another thread about that.

--
Thomas Munro
https://enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tomas Vondra 2019-05-16 22:54:27 Re: Avoiding hash join batch explosions with extreme skew and weird stats
Previous Message Thomas Munro 2019-05-16 21:46:13 Re: PSA: New intel MDS vulnerability mitigations cause measurable slowdown