Re: Avoiding hash join batch explosions with extreme skew and weird stats

From: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Avoiding hash join batch explosions with extreme skew and weird stats
Date: 2019-05-17 00:26:23
Message-ID: CA+hUKGK1=Xa+AV3NR4PX0XNPwUbnVG=f-P3Abw17X1wE4Yevmg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, May 17, 2019 at 11:46 AM Tomas Vondra
<tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
> On Thu, May 16, 2019 at 06:58:43PM -0400, Tom Lane wrote:
> >Thomas Munro <thomas(dot)munro(at)gmail(dot)com> writes:
> >> On Fri, May 17, 2019 at 4:39 AM Tomas Vondra
> >> <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
> >>> I kinda like the idea with increasing the spaceAllowed value. Essentially,
> >>> if we decide adding batches would be pointless, increasing the memory
> >>> budget is the only thing we can do anyway.
> >
> >> But that's not OK, we need to fix THAT.
> >
> >I don't think it's necessarily a good idea to suppose that we MUST
> >fit in work_mem come what may. It's likely impossible to guarantee
> >that in all cases. Even if we can, a query that runs for eons will
> >help nobody.
>
> I kinda agree with Thomas - arbitrarily increasing work_mem is something
> we should not do unless abosolutely necessary. If the query is slow, it's
> up to the user to bump the value up, if deemed appropriate.

+1

I think we can gaurantee that we can fit in work_mem with only one
exception: we have to allow work_mem to be exceeded when we otherwise
couldn't fit a single tuple.

Then the worst possible case with the looping algorithm is that we
degrade to loading just one inner tuple at a time into the hash table,
at which point we effectively have a nested loop join (except (1) it's
flipped around: for each tuple on the inner side, we scan the outer
side; and (2) we can handle full outer joins). In any reasonable case
you'll have a decent amount of tuples at a time, so you won't have to
loop too many times so it's not really quadratic in the number of
tuples. The realisation that it's a nested loop join in the extreme
case is probably why the MySQL people called it 'block nested loop
join' (and as far as I can tell from quick googling, it might be their
*primary* strategy for hash joins that don't fit in memory, not just a
secondary strategy after Grace fails, but I might be wrong about
that). Unlike plain old single-tuple nested loop join, it works in
arbitrary sized blocks (the hash table). What we would call a regular
hash join, they call a BNL that just happens to have only one loop. I
think Grace is probably a better primary strategy, but loops are a
good fallback.

The reason I kept mentioning sort-merge in earlier threads is because
it'd be better in the worst cases. Unfortunately it would be worse in
the best case (smallish numbers of loops) and I suspect many real
world cases. It's hard to decide, so perhaps we should be happy that
sort-merge can't be considered currently because the join conditions
may not be merge-joinable.

--
Thomas Munro
https://enterprisedb.com

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Alexander Korotkov 2019-05-17 00:56:44 Re: Re: SQL/JSON: functions
Previous Message Peter Geoghegan 2019-05-17 00:05:13 Re: PostgreSQL 12: Feature Highlights