Re: Avoiding OOM in a hash join with many duplicate inner keys

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, David Hinkle <hinkle(at)cipafilter(dot)com>
Subject: Re: Avoiding OOM in a hash join with many duplicate inner keys
Date: 2017-02-19 13:42:38
Message-ID: CA+TgmobT9BRG+rp8VmWdHzV_U=BVr24GeinfwNaqDE4+Mms83g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Feb 16, 2017 at 8:13 PM, Thomas Munro
<thomas(dot)munro(at)enterprisedb(dot)com> wrote:
> Obviously there are vanishing returns here as we add more defences
> making it increasingly unlikely that we hit "fail" mode. But it
> bothers me that hash joins in general are not 100% guaranteed to be
> able to complete unless you have infinite RAM.

I think in practice most people are forced to set work_mem to such a
small percentage of their available RAM that actual RAM exhaustion is
quite rare. The default value of 4MB is probably conservative even
for a Raspberry Pi, at least until the connection count spikes
unexpectedly, or until you have this problem:

https://www.postgresql.org/message-id/16161.1324414006@sss.pgh.pa.us

Most advice that I've seen for work_mem involves choosing values like
RAM / (4 * max_connections), which tends to result in much larger
values that are typically still small very small compared to the
amount of RAM that's available at any given moment, because most of
the time you either don't have the maximum number of connections or
some of them are idle or not all of them are using plans that need any
work_mem. Unfortunately, even with these very conservative settings,
one sometimes sees a machine get absolutely swamped by a large
activity spike at a time when all of the backends just so happen to be
running a query that uses 2 or 3 (or 180) copies of work_mem.[1]

If I were going to try to do something about the problem of machines
running out of memory, I'd be inclined to look at the problem more
broadly than "hey, hash joins can exceed work_mem if certain bad
things happen" and instead think about "hey, work_mem is a stupid way
of deciding on a memory budget". The intrinsic stupidity of work_mem
as an allocation system means that (1) it's perfectly possible to run
out of memory even if every node respects the memory budget and (2)
it's perfectly possible to drastically underutilize the memory you do
have even if some nodes fail to respect the memory budget. Of course,
if we had a smarter system for deciding on the budget it would be more
not less important for nodes to respect the budget they were given, so
that's not really an argument against trying to plug the hole you're
complaining about here, just a doubt about how much it will improve
the user experience if that's the only thing you do.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

[1] Or all of the connections just touch each of your 100,000
relations and the backend-local caches fill up and the whole system
falls over without using any work_mem at all.

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Dilip Kumar 2017-02-19 13:48:43 Re: Parallel bitmap heap scan
Previous Message Amit Kapila 2017-02-19 13:20:59 Re: Instability in select_parallel regression test