Re: bad estimation together with large work_mem generates terrible slow hash joins

From: "Tomas Vondra" <tv(at)fuzzy(dot)cz>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "Tomas Vondra" <tv(at)fuzzy(dot)cz>, "Robert Haas" <robertmhaas(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: bad estimation together with large work_mem generates terrible slow hash joins
Date: 2014-09-11 16:12:33
Message-ID: 2e5de7af7b6f77f862824d483db2bf61.squirrel@sq.gransy.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 11 Září 2014, 17:28, Tom Lane wrote:
> "Tomas Vondra" <tv(at)fuzzy(dot)cz> writes:
>> On 11 Z?????? 2014, 16:11, Tom Lane wrote:
>>> Ah. Well, that would mean that we need a heuristic for deciding when
>>> to
>>> increase the number of buckets versus the number of batches ... seems
>>> like a difficult decision.
>
>> That's true, but that's not the aim of this patch. The patch simply
>> increases the number of buckets if the load happens to get too high, and
>> does not try to decide between increasing nbuckets and nbatch.
>
> On further thought, increasing nbuckets without changing the batch
> boundaries would not get us out of an out-of-work_mem situation, in fact
> it makes memory consumption worse not better (assuming you count the
> bucket headers towards work_mem ;-)).

Yes, the patch counts the bucket headers towards work_mem, while the
original code didn't. The reasoning is that due to changing
NTUP_PER_BUCKET from 10 to 1, the memory occupied by buckets is not
negligible and should be accounted for.

> So in principle, the rule seems like it ought to go "if load (defined as
> max bucket chain length, I imagine?) gets too high, but we are still
> well below work_mem, increase nbuckets; else increase nbatch". And
> perhaps we reset nbuckets again for the next batch, not sure. If we
> are dealing with an out-of-work_mem situation then only increasing nbatch
> would be a suitable response.

Almost.

If we expect batching at the very beginning, we size nbuckets for "full
work_mem" (see how many tuples we can get into work_mem, while not
breaking NTUP_PER_BUCKET threshold).

If we expect to be fine without batching, we start with the 'right'
nbuckets and track the optimal nbuckets as we go (without actually
resizing the hash table). Once we hit work_mem (considering the optimal
nbuckets value), we keep the value.

Only at the end, we check whether (nbuckets != nbuckets_optimal) and
resize the hash table if needed. Also, we keep this value for all batches
(it's OK because it assumes full work_mem, and it makes the batchno
evaluation trivial). So the resize happens only once and only for the
first batch.

> Because of the risk that increasing nbuckets would itself lead to a
> work_mem violation, I don't think it's sane to ignore the interaction
> entirely, even in a first patch.

This possibility of increasing the number of batches is the consequence of
counting the buckets towards work_mem. I believe that's the right thing to
do here, to make the work_mem guarantee stronger, but it's not something
this patch depends on. If this is the only concern, we can leave this out.

It's also worth mentioning that the dense allocation of tuples makes a
huge difference. palloc easily results in 100% overhead for narrow tuples
(that is, allocating 2x the amount of needed memory). For example over
here [1] is an example of a hashjoin consuming 1.4GB with work_mem=800MB.

And the dense allocation patch pretty much makes this go away (as
demonstrated in the [1]). For wider tuples, the difference is smaller, but
that when the buckets are less important.

What I'm trying to say is that it's only a matter of work_mem values.
Currently, because of the weak guarantee, people tend to set the value
lower than necessary. The dense allocation makes this unnecessary,
allowing higher work_mem values, and thus eliminating the issue.

If this is not enough, we can stop counting the buckets towards work_mem.
It'll still be more memory efficient than the old approach (as a pointer
is smaller than palloc overhead), and it won't cause additional batches.
But I don't like this - I think we should make work_mem a stronger
guarantee.

[1] http://www.postgresql.org/message-id/53BEEA9E.2080009@fuzzy.cz

regards
Tomas

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2014-09-11 16:52:46 Re: RLS Design
Previous Message Tom Lane 2014-09-11 16:02:43 Re: Optimization for updating foreign tables in Postgres FDW