Quick Links

Re: bad estimation together with large work_mem generates terrible slow hash joins

From:	Tomas Vondra <tv(at)fuzzy(dot)cz>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: bad estimation together with large work_mem generates terrible slow hash joins
Date:	2014-10-10 18:08:38
Message-ID:	543820A6.6030500@fuzzy.cz
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Hi!

On 9.10.2014 22:28, Kevin Grittner wrote:
> Tomas Vondra <tv(at)fuzzy(dot)cz> wrote:
>>
>> The only case I've been able to come up with is when the hash table
>> fits into work_mem only thanks to not counting the buckets. The new
>> code will start batching in this case.
>
> Hmm. If you look at the timings in my posting from 2014-10-01 I had
> cases where the patched version started with one batch and went to
> two, while the unpatched used just one batch, and the patched version
> was still more than twice as fast. I'm sure the "on disk" batch was
> fully cached, however; it might work out differently if disk speed
> actually came into the picture.

I think the case with large outer table and memory pressure, forcing the
batches to be actually written to disk (instead of just being in the
page cache) is the main concern here.

>> That is mostly luck, however, because it depends on the cardinality
>> estimates, and the best way to fix it is increasing work_mem (which is
>> safer thanks to reducing the overhead).
>>
>> Also, Robert proposed a way to mitigate this, if we realize we'd
>> have to do batching during the initial sizing, we can peek whether
>> reducing the number of buckets (to 1/2 or maybe 1/4) would help. I
>> believe this is a good approach, and will look into that after
>> pgconf.eu (i.e. early November), unless someone else is interested.
>
> Sure, but it would be good to confirm that it's actually needed first.

Yeah. But with cleverly crafted test case it's possible to confirm
almost everything ;-)

>>> When given a generous work_mem setting the patched version often uses
>>> more of what it is allowed than the unpatched version (which is
>>> probably one of the reasons it tends to do better). If someone has
>>> set a high work_mem and has gotten by only because the configured
>>> amount is not all being used when it would benefit performance, they
>>> may need to adjust work_mem down to avoid memory problems. That
>>> doesn't seem to me to be a reason to reject the patch.
>>
>> I'm not entirely sure I understand this paragraph. What do you mean by
>> "configured amount is not all being used when it would benefit
>> performance"? Can you give an example?
>
> Again, the earlier post showed that while the unpatched used 3516kB
> whether it had work_mem set to 4MB or 1GB, the patched version used
> 3073kB when work_mem was set to 4MB and 4540kB when work_mem was
> set to 1GB. The extra memory allowed the patched version to stay
> at 1 batch, improving performance over the other setting.

OK, so with 4MB the new version was batching, while the original code
keeps a single batch (likely due to not counting buckets into work_mem).
I believe that's expected behavior.

Also, in the e-mail from Oct 2 that got lost [1], I pointed out that by
testing only with small hash tables the results are likely influenced by
high CPU cache hit ratio. I think benchmarking with larger inner
relation (thus increasing the memory occupied by the hash table) might
be interesting.

[1]
http://www.postgresql.org/message-id/c84680e818f6a0f4a26223cd750ff252.squirrel@2.emaily.eu

>
>> The only thing I can think of is the batching behavior described above.
>>
>>> This is in "Waiting on Author" status only because I never got an
>>> answer about why the debug code used printf() rather the elog() at
>>> a DEBUG level. Other than that, I would say this patch is Ready
>>> for Committer. Tomas? You there?
>>
>> I think I responded to that on October 2, quoting:
...
> Ah, I never received that email. That tends to happen every now
> and then. :-(
>
>> I believe it's safe to switch the logging to elog(). IMHO the printf
>> logging is there from some very early version of the code, before elog
>> was introduced. Or something like that.
>
> I guess it might be best to remain consistent in this patch and
> change that in a separate patch. I just wanted to make sure you
> didn't see any reason not to do so.

OK, I'll put that on my TODO list for the next CF.

> With that addressed I will move this to Ready for Committer. Since
> both Heikki and Robert spent time on this patch earlier, I'll give
> either of them a shot at committing it if they want; otherwise I'll
> do it.

Great, thanks!
Tomas

In response to

Re: bad estimation together with large work_mem generates terrible slow hash joins at 2014-10-09 20:28:53 from Kevin Grittner

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Tomas Vondra	2014-10-10 18:13:24	Re: Yet another abort-early plan disaster on 9.3
Previous Message	Robert Haas	2014-10-10 18:05:27	Re: UPSERT wiki page, and SQL MERGE syntax