Treating work_mem as a shared resource (Was: Parallel Hash take II)

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Robert Haas <robertmhaas(at)gmail(dot)com>, Rushabh Lathia <rushabh(dot)lathia(at)gmail(dot)com>, Prabhat Sahu <prabhat(dot)sahu(at)enterprisedb(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>, Haribabu Kommi <kommi(dot)haribabu(at)gmail(dot)com>, Oleg Golovanov <rentech(at)mail(dot)ru>
Subject: Treating work_mem as a shared resource (Was: Parallel Hash take II)
Date: 2017-11-16 03:38:17
Message-ID: CAH2-WzmNwV=LfDRXPsmCqgmm91mp=2b4FvXNF=cCvMrb8YFLfQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Nov 15, 2017 at 1:06 PM, Thomas Munro
<thomas(dot)munro(at)enterprisedb(dot)com> wrote:
> In the old days, Oracle had only simple per-operation memory limits
> too, and that applied to every operation in every thread just like our
> work_mem. It's interesting that they had separate knobs for sort and
> hash though, and defaulted to giving hash twice as much.

It makes plenty of sense to give hash twice as much IMV.

I'm interested in the *economics* of how we use memory -- I think we
could do a lot better there. Memory capacities have certainly
increased dramatically over the years since sort_mem became work_mem,
but I suspect that users are (quite reasonably) still giving most of
that over to shared_buffers/FS cache. And, storage/network bandwidth
also increased dramatically during that time, so single-pass external
sorts will continue to be a sensible thing to do frequently. Hashing
is a different story, though -- you really do want to make sure that
hash-based operations have access to more memory, where it can really
go to use.

Though I am primarily concerned about the fact that we don't give any
weight to how sensitive hash-based operations are to having less
memory, I don't really want to talk about band-aid measures like a
hash_mem GUC (though that does have a certain appeal). I want to talk
about moving past work_mem, and towards a model where work_mem-like
memory is treated like a shared resource, under a regime that
intelligently weighs the effects of making more memory available to
one plan based on system-wide accounting, and the sensitivity of each
memory-consuming operation to the availability of memory. This thread
is intended to get the ball rolling on that.

It seems like we need something like this to get the full benefit of
our numerous enhancements to sorting and hashing.

> With a whole-plan memory target, our planner would probably begin to
> plan join order differently to minimise the number of hash tables in
> memory at once, like other RDBMSs. Not sure how the plan-wide target
> should work though -- try many plans, giving different portions of
> budget to different subplans? That should work fine if you like
> O(please-melt-my-computer), especially if combined with a similar
> approach to choosing worker numbers. Some kind of feedback system?
> Seems like a different kind of planner, but I have no clue. If you
> have ideas/papers/references, it'd be great to see a new thread on
> that subject.

My first thought is that we might implement a model where little
changes about work_mem itself -- it becomes a minimum for each
work_mem consuming operation. There could be an additional "emergency
memory fund" that individual plan nodes can avail of during execution,
if and when it looks like they'll fall underneath an allocation that
allows the work_mem-consuming operation to perform "optimally" or
"reasonably". This would happen at certain "penny-wise, pound foolish"
points. There'd be big differences in how we do this for sorts as
compared to hash joins, because the underlying cost function for each
look totally different. There'd probably be a maximum amount of memory
that each executor node could request from the emergency fund, such as
a smallish multiple of work_mem (work_mem being the minimum budget it
can *reliably* claim). A single hash join asking for the entire
emergency fund for itself all at once seems excessive, and likely to
create problems in other areas, so that should be impossible. And, it
should be "hard" for a node to ask for and/or receive the absolute
maximum a node can get, because we want to keep that for cases that
would otherwise truly be much slower.

All in all, this approach shouldn't be too radical a departure from
the work_mem model. I admit that there are significant risks with this
approach as a project. It seems like there is a chance that it won't
be ambitious enough in the end, because there are so many competing
considerations. At the same time, I cannot help but be concerned about
how naive we are about how sorting and hashing respond to work_mem.

Anyway, here is how I see the extra/emergency requests working for
specific operations:

* For sorts, a non-optimal sort (a sort that asks for more memory)
would ask for memory when it first looked like multiple passes will be
required. As I pointed out in my Sort vs. Hash talk, that's actually
pretty rare these days, because as work_mem doubles, the capacity to
do everything in one pass quadruples. You should really never get
multiple passes for an external sort -- the economics of doing it that
way with any frequency are not likely to be sensible on modern
machines.

* For hash join, the "emergency request" logic could be much more
sophisticated, and would be much more likely to be used in practice. I
think that we'd probably want to worry about the difference between
performing a hash join entirely in-memory versus having to do some
amount of batching (unlike sorting). This would generally be more
"eager" than the requests that tuplesort makes, because smaller
differences matter much more, much sooner.

* For hash aggregates, we might have "overage requests" from the
emergency fund, or something along those lines. The emergency fund
might therefore be in deficit (negative) when hash aggregates
misbehave, since it cannot "say no" to these requests (hash aggregate
will not currently take no for an answer, since it has no emergency
spill mechanism). This could limit the overall impact of that
happening, and might also provide a useful choke point for new
alerting and monitoring systems that can hook into the "central memory
management" logic. Hash aggregates might go over work_mem without it
really mattering much of the time.

ISTM that there is a similar dilemma to this "sort versus hash"
dilemma for maintenance_work_mem tasks: the "CREATE INDEX versus
VACUUM" dilemma. We should try to address that as part of this effort.
(This dilemma is one reason why I wrote the autovacuum_work_mem patch
-- that's not a million miles from the idea of a hash_mem GUC.)

To come up with a real proposal for treating local memory as a shared
resource, I think that we need:

* To hear more ideas about keeping things in balance here. How clever
should we be?

* To experimentally verify that the cost functions (latency as a
function of memory) for things like sorting, merge join, hash join,
and hash aggregate are what we think they are.

* To understand how this relates to admission control. The only
obvious difference that I can think of is that admission control
probably involves queuing when very memory constrained, and isn't
limited to caring about memory. I'm not trying to make swapping/OOM
impossible here; I'm trying to make it easier to be a Postgres DBA
sizing work_mem, and make it so that DBAs don't have to be stingy with
work_mem. The work_mem sizing formulas we sometimes promote (based on
max_connections) are probably very limiting in the real world.

* To better understand the role of the optimizer here. Can we get more
hash aggregate plans in the first place without risking swapping/OOM?
Is it okay that what I propose kind of works against that goal?

* To better understand the role of parallel query here.

* To try to find a way to assess what "better" here looks like,
overall. What benchmark might we devise to assess what a good, robust
strategy looks like?

I freely admit that my proposal is pretty hand-wavy at this point,
but, as I said, I want to at least get the ball rolling.
--
Peter Geoghegan

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message David Rowley 2017-11-16 03:44:02 Re: [HACKERS] path toward faster partition pruning
Previous Message Michael Paquier 2017-11-16 03:09:05 Re: [HACKERS] Timeline ID in backup_label file