Re: bad estimation together with large work_mem generates terrible slow hash joins

From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: Tomas Vondra <tv(at)fuzzy(dot)cz>, <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: bad estimation together with large work_mem generates terrible slow hash joins
Date: 2014-09-10 18:25:32
Message-ID: 5410979C.6000509@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 09/10/2014 01:49 AM, Tomas Vondra wrote:
> On 9.9.2014 16:09, Robert Haas wrote:
>> On Mon, Sep 8, 2014 at 5:53 PM, Tomas Vondra <tv(at)fuzzy(dot)cz> wrote:
>>> So I only posted the separate patch for those who want to do a review,
>>> and then a "complete patch" with both parts combined. But it sure may be
>>> a bit confusing.
>>
>> Let's do this: post each new version of the patches only on this
>> thread, as a series of patches that can be applied one after another.
>
> OK, attached. Apply in this order
>
> 1) dense-alloc-v5.patch
> 2) hashjoin-nbuckets-v12.patch

The dense-alloc-v5.patch looks good to me. I have committed that with
minor cleanup (more comments below). I have not looked at the second patch.

> I also did a few 'minor' changes to the dense allocation patch, most
> notably:
>
> * renamed HashChunk/HashChunkData to MemoryChunk/MemoryChunkData
> The original naming seemed a bit awkward.

That's too easy to confuse with regular MemoryContext and AllocChunk
stuff. I renamed it to HashMemoryChunk.

> * the chunks size is 32kB (instead of 16kB), and we're using 1/4
> threshold for 'oversized' items
>
> We need the threshold to be >=8kB, to trigger the special case
> within AllocSet. The 1/4 rule is consistent with ALLOC_CHUNK_FRACTION.

Should we care about the fact that if there are only a few tuples, we
will nevertheless waste 32kB of memory for the chunk? I guess not, but I
thought I'd mention it. The smallest allowed value for work_mem is 64kB.

> I also considered Heikki's suggestion that while rebatching, we can
> reuse chunks with a single large tuple, instead of copying it
> needlessly. My attempt however made the code much uglier (I almost used
> a GOTO, but my hands rejected to type it ...). I'll look into that.

I tried constructing a test case where the rehashing time would be
significant enough for this to matter, but I couldn't find one. Even if
the original estimate on the number of batches is way too small, we ramp
up quickly to a reasonable size and the rehashing time is completely
insignificant compared to all the other work. So I think we can drop
that idea.

- Heikki

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2014-09-10 18:31:14 Re: bad estimation together with large work_mem generates terrible slow hash joins
Previous Message Peter Geoghegan 2014-09-10 17:36:30 Re: B-Tree support function number 3 (strxfrm() optimization)