Re: bad estimation together with large work_mem generates terrible slow hash joins

From: Tomas Vondra <tv(at)fuzzy(dot)cz>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: bad estimation together with large work_mem generates terrible slow hash joins
Date: 2014-09-08 21:53:40
Message-ID: 540E2564.6030106@fuzzy.cz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 8.9.2014 22:44, Robert Haas wrote:
> On Fri, Sep 5, 2014 at 3:23 PM, Tomas Vondra <tv(at)fuzzy(dot)cz> wrote:
>> as Heikki mentioned in his "commitfest status" message, this patch
>> still has no reviewers :-( Is there anyone willing to pick up that,
>> together with the 'dense allocation' patch (as those two are
>> closely related)?
>>
>> I'm looking in Robert's direction, as he's the one who came up with
>> the dense allocation idea, and also commented on the hasjoin bucket
>> resizing patch ...
>>
>> I'd ask Pavel Stehule, but as he's sitting next to me in the office
>> he's not really independent ;-) I'll do some reviews on the other
>> patches over the weekend, to balance this of course.
>
> Will any of those patches to be, heh heh, mine?

I'll exchange some of the credit for +1. Deal? ;-)

> I am a bit confused by the relationship between the two patches you
> posted. The "combined" patch applies, but the other one does not. If
> this is a sequence of two patches, it seems like it would be more
> helpful to post A and B rather than B and A+B. Perhaps the other
> patch is on a different thread but there's not an obvious reference
> to said thread here.

Yeah, it's probably a bit confusing. The thing is the "dense allocation"
idea was discussed in a different thread, so I posted the patch there:

http://www.postgresql.org/message-id/53CBEB8A.207@fuzzy.cz

The patch tweaking hash join buckets builds on the dense allocation,
because it (a) makes up for the memory used for more buckets and (b)
it's actually easier to rebuild the buckets this way.

So I only posted the separate patch for those who want to do a review,
and then a "complete patch" with both parts combined. But it sure may be
a bit confusing.

> In ExecChooseHashTableSize(), if buckets_bytes is independent of
> nbatch, could we just compute it before working out dbatch, and just
> deduct it from hash_table_bytes? That seems like it'd do the same
> thing for less work.

Well, we can do that. But I really don't think that's going to make
measurable difference. And I think the code is more clear this way.

> Either way, what happens if buckets_bytes is already bigger than
> hash_table_bytes?

Hm, I don't see how that could happen.

FWIW, I think the first buckets_bytes formula is actually wrong:

buckets_bytes
= sizeof(HashJoinTuple) * my_log2(ntuples / NTUP_PER_BUCKET);

and should be

buckets_bytes =
sizeof(HashJoinTuple) * (1 << my_log2(ntuples / NTUP_PER_BUCKET));

instead. Also, we might consider that we never use less than 1024
buckets (but that's minor issue, I guess).

But once we get into the "need batching" branch, we do this:

lbuckets = 1 << my_log2(hash_table_bytes / (tupsize * NTUP_PER_BUCKET
+ sizeof(HashJoinTuple)));

which includes both the bucket (pointer) and tuple size, and IMHO
guarantees that bucket_bytes will never be over work_mem (which is what
hash_table_bytes is).

The only case where this might happen is (tupsize < 8), thanks to the
my_log2 (getting (50% work_mem + epsilon), doubled to >100% work_mem).

But tupsize is defined as this:

tupsize = HJTUPLE_OVERHEAD +
MAXALIGN(sizeof(MinimalTupleData)) +
MAXALIGN(tupwidth);

and HJTUPLE_OVERHEAD alone is 12B, MinimalTupleData is >11B (ignoring
alignment) etc.

So I believe this is safe ...

> A few more stylistic issues that I see:
>
> + if ((hashtable->nbatch == 1) &&
> + if (hashtable->nbuckets_optimal <= (INT_MAX/2))
> + if (size > (HASH_CHUNK_SIZE/8))
>
> While I'm all in favor of using parentheses generously where it's
> needed to add clarity, these and similar usages seem over the top to
> me.

It seemed more readable for me at the time of coding it, and it still
seems better this way, but ...

https://www.youtube.com/watch?v=CxK_nA2iVXw

> On a related note, the first of these reads like this if (stuff) { if
> (more stuff) { do things } } which makes one wonder why we need two if
> statements.

We probably don't ...

>
> +
> + /* used for dense allocation of tuples (into linked chunks) */
> + HashChunk chunks_batch; /* one list for the whole batch */
> +
> } HashJoinTableData;
>
> If the original coding didn't have a blank line between the last
> structure member and the closing brace, it's probably not right to
> make it that way now. There are similar issues at the end of some
> functions you modified, and a few other places (like the new code in
> ExecChooseHashTableSize and chunk_alloc) where there are extra blank
> lines at the starts and ends of blocks.

I'll fix that. FWIW, those issues seem to be in the 'dense allocation'
patch.

>
> +char * chunk_alloc(HashJoinTable hashtable, int size) {
>
> Eh... no.
>
> + /* XXX maybe we should use MAXALIGN(size) here ... */
>
> +1.

I'm not sure what's the 'no' pointing at? Maybe that the parenthesis
should be on the next line? And is the +1 about doing MAXALING?

regards
Tomas

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message David G Johnston 2014-09-08 22:20:25 Re: PQputCopyEnd doesn't adhere to its API contract
Previous Message Thom Brown 2014-09-08 21:41:00 Re: Scaling shared buffer eviction