|From:||Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>|
|To:||Thomas Munro <thomas(dot)munro(at)gmail(dot)com>|
|Subject:||Re: Avoiding hash join batch explosions with extreme skew and weird stats|
|Views:||Raw Message | Whole Thread | Download mbox | Resend email|
On Thu, May 16, 2019 at 01:22:31PM +1200, Thomas Munro wrote:
>Here's a quick hack to show that a 95% cut-off fixes those examples.
>I don't really know how to choose the number, but I suspect it should
>be much closer to 100 than 50. I think this is the easiest of three
>fundamental problems that need to be solved in this area. The others
>are: accounting for per-partition overheads as Tomas pointed out, and
>providing an actual fallback strategy that respects work_mem when
>extreme skew is detected OR per-partition overheads dominate. I plan
>to experiment with nested loop hash join (or whatever you want to call
>it: the thing where you join every arbitrary fragment of the hash
>table against the outer batch, and somehow deal with outer match
>flags) when time permits.
I think this is a step in the right direction, but as I said on the other
thread(s), I think we should not disable growth forever and recheck once
in a while. Otherwise we'll end up in sad situation with non-uniform data
sets, as poined out by Hubert Zhang in . It's probably even truer with
this less strict logic, using 95% as a threshold (instead of 100%).
I kinda like the idea with increasing the spaceAllowed value. Essentially,
if we decide adding batches would be pointless, increasing the memory
budget is the only thing we can do anyway.
The problem however is that we only really look at a single bit - it may
be that doubling the batches would not help, but doing it twice would
actually reduce the memory usage. For example, assume there are 2 distinct
values in the batch, with hash values (in binary)
and assume we currently. Clearly, splitting batches is going to do nothing
until we get to the 000 vs. 111 parts.
At first I thought this is rather unlikely and we can ignore that, but I'm
not really sure about that - it may actually be pretty likely. We may get
to 101010 bucket with sufficiently large data set, and then it's ~50%
probability the next bit is the same (assuming two distinct values). So
this may be quite an issue, I think.
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
|Next Message||Andrey Borodin||2019-05-16 17:13:22||Re: pglz performance|
|Previous Message||Albert Cervera i Areny||2019-05-16 16:24:40||Re: PSA: New intel MDS vulnerability mitigations cause measurable slowdown|