Re: Avoiding OOM in a hash join with many duplicate inner keys

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, David Hinkle <hinkle(at)cipafilter(dot)com>
Subject: Re: Avoiding OOM in a hash join with many duplicate inner keys
Date: 2017-02-16 20:56:49
Message-ID: CA+TgmoZCFcNs+BuTziv+zY=MAZ2yqtR=0Kq5305s4QaRxCOqiw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Feb 16, 2017 at 3:51 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> On Thu, Feb 16, 2017 at 2:38 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>>> I initially thought about driving the shutoff strictly from the estimate
>>> of the MCV frequency, without involving the more general ndistinct
>>> computation that estimate_hash_bucketsize does. I'm not sure how much
>>> that would do for your concern, but at least the MCV frequency doesn't
>>> involve quite as much extrapolation as ndistinct.
>
>> Hmm, so we could do something like: if the estimated frequency of the
>> least-common MCV is enough to make one bucket overflow work_mem, then
>> don't use a hash join? That would still be prone to some error (in
>> both directions, really) but it seems less likely to spit out
>> completely stupid results than relying on ndistinct, which never gets
>> very big even in a 10TB table.
>
> No, it'd be the *most* common MCV, because we're concerned about the
> worst-case (largest) bucket size. But that's good, really, because the
> highest MCV frequency will be the one we have most statistical
> confidence in. There's generally a whole lot of noise in the tail-end
> MCV numbers.

Oh, right. That's reassuring, as it seems like it has a much better
chance of actually being right.

> Also, I'd be inclined to do nothing (no shutoff) if we have no MCV
> stats. That would be an expected case if the column is believed unique,
> and it's probably a better fallback behavior when we simply don't have
> stats. With the ndistinct-based rule, we'd be shutting off hashjoin
> almost always when we don't have stats. Given how long it took us
> to recognize this problem, that's probably the wrong default.

Right.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message David Christensen 2017-02-16 20:58:57 [PATCH] Add pg_disable_checksums() and supporting infrastructure
Previous Message Tom Lane 2017-02-16 20:51:20 Re: Avoiding OOM in a hash join with many duplicate inner keys