Re: Avoiding hash join batch explosions with extreme skew and weird stats

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Avoiding hash join batch explosions with extreme skew and weird stats
Date: 2019-05-16 22:54:27
Message-ID: 20190516225427.uvqthiocuibkgh2s@development
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, May 17, 2019 at 10:21:56AM +1200, Thomas Munro wrote:
>On Fri, May 17, 2019 at 4:39 AM Tomas Vondra
><tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>> I think this is a step in the right direction, but as I said on the other
>> thread(s), I think we should not disable growth forever and recheck once
>> in a while. Otherwise we'll end up in sad situation with non-uniform data
>> sets, as poined out by Hubert Zhang in [1]. It's probably even truer with
>> this less strict logic, using 95% as a threshold (instead of 100%).
>>
>> I kinda like the idea with increasing the spaceAllowed value. Essentially,
>> if we decide adding batches would be pointless, increasing the memory
>> budget is the only thing we can do anyway.
>
>But that's not OK, we need to fix THAT.
>

I agree increasing the budget is not ideal, althought at the moment it's
the only thing we can do. If we can improve that, great.

>> The problem however is that we only really look at a single bit - it may
>> be that doubling the batches would not help, but doing it twice would
>> actually reduce the memory usage. For example, assume there are 2 distinct
>> values in the batch, with hash values (in binary)
>
>Yes, that's a good point, and not a case that we should ignore. But
>if we had a decent fall-back strategy that respected work_mem, we
>wouldn't care so much if we get it wrong in a corner case. I'm
>arguing that we should use Grace partitioning as our primary
>partitioning strategy, but fall back to looping (or possibly
>sort-merging) for the current batch if Grace doesn't seem to be
>working. You'll always be able to find cases where if you'd just
>tried one more round, Grace would work, but that seems acceptable to
>me, because getting it wrong doesn't melt your computer, it just
>probably takes longer. Or maybe it doesn't. How much longer would it
>take to loop twice? Erm, twice as long, and each loop makes actual
>progress, unlike extra speculative Grace partition expansions which
>apply not just to the current batch but all batches, might not
>actually work, and you *have* to abandon at some point. The more I
>think about it, the more I think that a loop-base escape valve, though
>unpalatably quadratic, is probably OK because we're in a sink-or-swim
>situation at this point, and our budget is work_mem, not work_time.
>

True.

>I'm concerned that we're trying to find ways to treat the symptoms,
>allowing us to exceed work_mem but maybe not so much, instead of
>focusing on the fundamental problem, which is that we don't yet have
>an algorithm that is guaranteed to respect work_mem.
>

Yes, that's a good point.

>Admittedly I don't have a patch, just a bunch of handwaving. One
>reason I haven't attempted to write it is because although I know how
>to do the non-parallel version using a BufFile full of match bits in
>sync with the tuples for outer joins, I haven't figured out how to do
>it for parallel-aware hash join, because then each loop over the outer
>batch could see different tuples in each participant. You could use
>the match bit in HashJoinTuple header, but then you'd have to write
>all the tuples out again, which is more IO than I want to do. I'll
>probably start another thread about that.
>

That pesky parallelism ;-)

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2019-05-16 22:58:43 Re: Avoiding hash join batch explosions with extreme skew and weird stats
Previous Message Thomas Munro 2019-05-16 22:21:56 Re: Avoiding hash join batch explosions with extreme skew and weird stats