Re: bad estimation together with large work_mem generates terrible slow hash joins

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Tomas Vondra <tv(at)fuzzy(dot)cz>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: bad estimation together with large work_mem generates terrible slow hash joins
Date: 2014-09-11 15:21:53
Message-ID: CA+TgmobBn0KgsFHRvQWY91NtBNvWAxdyKQL7tjon8jh=nk3VHQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Sep 11, 2014 at 10:11 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> On Thu, Sep 11, 2014 at 9:59 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>>> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>>>> (3) It allows the number of batches to increase on the fly while the
>>>> hash join is in process.
>
>>> Pardon me for not having read the patch yet, but what part of (3)
>>> wasn't there already?
>
>> EINSUFFICIENTCAFFEINE.
>
>> It allows the number of BUCKETS to increase, not the number of
>> batches. As you say, the number of batches could already increase.
>
> Ah. Well, that would mean that we need a heuristic for deciding when to
> increase the number of buckets versus the number of batches ... seems
> like a difficult decision.

Well, the patch takes the approach of increasing the number of buckets
when (1) there's only one batch and (2) the load factor exceeds
NTUP_PER_BUCKET. As Tomas points out, it's just increasing the number
of buckets to the exact same number which we would have allocated had
we known that this many tuples would show up.

Now, there is a possibility that we could lose. Let's suppose that
the tuples overflow work_mem, but just barely. If we've accurately
estimate how many tuples there are, the patch changes nothing: either
way we're going to need two batches. But let's say we've
underestimated the number of tuples by 3x. Then it could be that
without the patch we'd allocate fewer buckets, never increase the
number of buckets, and avoid batching; whereas with the patch, we'll
discover that our tuple estimate was wrong, increase the number of
buckets on the fly, and then be forced by the slightly-increased
memory consumption that results from increasing the number of buckets
to do two batches. That would suck.

But catering to that case is basically hoping that we'll fall into a
septic tank and come out smelling like a rose - that is, we're hoping
that our initial poor estimate will cause us to accidentally do the
right thing later. I don't think that's the way to go. It's much
more important to get the case where things are bigger than we
expected but still fit within work_mem right; and we're currently
taking a huge run-time penalty in that case, as Tomas' results show.
If we wanted to cater to the other case in the future, we could
consider the opposite approach: if work_mem is exhahusted, throw the
bucket headers away and keep reading tuples into the dense-packed
chunks added by yesterday's commit. If we again run out of work_mem,
then we *definitely* need to increase the batch count. If we manage
to make everything fit, then we know exactly how many bucket headers
we can afford, and can use some heuristic to decide between that and
using more batches.

I don't think that should be a requirement for this patch, though: I
think the cumulative effect of Tomas's patches is going to be a win
*much* more often than it's a loss. In 100% of cases, the
dense-packing stuff will make a hash table containing the same tuples
significantly smaller. Except when we've radically overestimated the
tuple count, the change to NTUP_PER_BUCKET will mean less time spent
chasing hash chains. It does use more memory, but that's more than
paid for by the first change. Both the dense-packing stuff and the
changes to include bucket headers in the work_mem calculation will
have the effect of making the work_mem bound on memory utilization far
more accurate than it's ever been previously. And the change to
increase the number of buckets at runtime will mean that we're
significantly more resilient against the planner underestimating the
number of tuples. That's clearly good. The fact that we can
construct borderline cases where it loses shouldn't deter us from
pushing forward.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Kouhei Kaigai 2014-09-11 15:24:07 Re: [v9.5] Custom Plan API
Previous Message Merlin Moncure 2014-09-11 15:17:16 Re: Memory Alignment in Postgres