Re: bad estimation together with large work_mem generates terrible slow hash joins

From: Tomas Vondra <tv(at)fuzzy(dot)cz>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: bad estimation together with large work_mem generates terrible slow hash joins
Date: 2014-09-12 20:55:33
Message-ID: 54135DC5.2070301@fuzzy.cz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 12.9.2014 22:24, Robert Haas wrote:
> On Fri, Sep 12, 2014 at 3:39 PM, Tomas Vondra <tv(at)fuzzy(dot)cz> wrote:
>> On 12.9.2014 18:49, Robert Haas wrote:
>>> I'm comfortable with this version if you are, but (maybe as a
>>> follow-on commit) I think we could make this even a bit smarter. If
>>> inner_rel_bytes + bucket_bytes > hash_table_bytes but
>>> inner_rel_bytes + bucket_bytes / 4 < hash_table_bytes, then reduce
>>> the number of buckets by 2x or 4x to make it fit. That would provide
>>> a bit of additional safety margin against cases where this patch
>>> might unluckily lose.
>>
>> I don't think that's a good idea. That essentially says 'we're shooting
>> for NTUP_PER_BUCKET but not really'. Moreover, I often see cases where
>> batching (because of smaller work_mem) actually significantly improves
>> performance. If we want to make this reasoning properly, deciding
>> whether to add batches or reduce buckets, we need a better heuristics -
>> that's quite complex and I'd expect it to result ain a quite complex patch.
>>
>> So -1 from me to this at this moment (it certainly needs much more
>> thought than a mere follow-on commit).
>
> OK, but let's discuss further. You make it sound like treating
> NTUP_PER_BUCKET as a soft limit is a bad thing, but I'm not convinced.
> I think it's perfectly reasonable to treat NTUP_PER_BUCKET as a range:
> if we can get NTUP_PER_BUCKET=1, great, but if not, NTUP_PER_BUCKET=2
> or NTUP_PER_BUCKET=4 may be perfectly reasonable.

I think this really depends on various factors. For example what may
work for small work_mem (fitting into L2 cache) may not work for large
values, etc.

> I'm actually quite surprised that you find batching to be a better
> strategy than skimping on buckets, because I would have expect the
> opposite, almost categorically. Batching means having to write out
> the tuples we can't process right away and read them back in. If
> that involves disk I/O, I think the cost of that I/O is going to be
> FAR more than the overhead we would have incurred by searching
> slightly longer bucket chains. If it didn't, then you could've set
> work_mem higher and avoided batching in the first place.

No, I don't find batching to be a better strategy. I just think this
really needs more discussion than a simple "let's use NTUP_PER_BUCKET=4
to avoid batching" follow-up patch.

For example, let's say we switch to NTUP_PER_BUCKET=4 to avoid batching,
and then discover we need to start batching anyway. Should we keep the
settings, or should we revert NTUP_PER_BUCKET=1? Or maybe not doing that
for nbatch=2, but for nbatch=16?

I'm not saying it's wrong, but I think it's more complicated than it
seems from your description.

regards
Tomas

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tomas Vondra 2014-09-12 21:09:03 Re: bad estimation together with large work_mem generates terrible slow hash joins
Previous Message ktm@rice.edu 2014-09-12 20:39:29 Re: [REVIEW] Re: Compression of full-page-writes