Re: OOM with many sorts

From: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To: Justin Pryzby <pryzby(at)telsasoft(dot)com>
Cc: pgsql-general <pgsql-general(at)postgresql(dot)org>
Subject: Re: OOM with many sorts
Date: 2019-07-09 04:43:21
Message-ID: CA+hUKGJ2yNYpH5X_AFSR9721FYJ23oW7UC0qZ=HjK9xW+Oqq0w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Tue, Jul 9, 2019 at 4:44 AM Justin Pryzby <pryzby(at)telsasoft(dot)com> wrote:
> [...many partial GroupAggregate/Sort repeated here for ~200 child tables...]
>
> This isn't the first time I've had to reduce work_mem on a parallel query to
> avoid OOM, but it seems unreasonable to be concerned with 50MB sorts.

This is a fundamental problem with our model of memory budgets. To
oversimplify massively, we can use about work_mem * ntables * the
number of parallel workers. Core counts are going up, and now we have
a convenient way to get large values of ntables. One of many -hackers
threads to discuss the problem:

https://www.postgresql.org/message-id/flat/CAH2-WzmNwV%3DLfDRXPsmCqgmm91mp%3D2b4FvXNF%3DcCvMrb8YFLfQ%40mail.gmail.com

> It looks like the results of each Sort node stay in RAM, during processing of
> additional sort nodes (is that required?)

That's a very interesting question linked to whole-query memory
budgets. If you don't have a whole-query memory budget, then you have
nothing to lose by keeping hash and sort results in memory, and you
gain the ability to do cheap rescans (if the node happens to be in a
nest loop). I discussed this in the context of hash joins over here:

https://www.postgresql.org/message-id/CAEepm%3D0N6DODN7nx6Zb93YOW-y%3DRftNNFZJRaLyG6jbJHJVjsA%40mail.gmail.com

The TL;DR is that some other databases throw out eg hash tables as
soon as possible, and consider changing the shape of their join
nesting in order to minimise the total number of hash tables in memory
at once, in order to come in under a certain budget for memory used at
any one point in time. That is, they don't just divide query_work_mem
up over all the nodes, they understand things about when nodes will
allocate and free memory.

There be some less clever things we can do, though, before we tackle
the big problems involved. We could probably still opportunistically
give back memory sooner, when we know there is no chance of rescan,
and other things along those lines.

> Thanks in advance for any advice.

I think it's impossible to choose a single value for work_mem if you
have a mixture of types of queries that hit wildly different numbers
of partitions and workers. I think this is an ongoing topic for
-hackers.

--
Thomas Munro
https://enterprisedb.com

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Igal @ Lucee.org 2019-07-09 06:48:25 pg_dump and search_path
Previous Message Taylor Sarrafian 2019-07-09 00:15:23 Logical Replication for Very Large Databases