Re: Add the ability to limit the amount of memory that can be allocated to backends.

From: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
To: "Anton A(dot) Melnikov" <a(dot)melnikov(at)postgrespro(dot)ru>, Andres Freund <andres(at)anarazel(dot)de>, Andrei Lepikhov <a(dot)lepikhov(at)postgrespro(dot)ru>
Cc: Stephen Frost <sfrost(at)snowman(dot)net>, reid(dot)thompson(at)crunchydata(dot)com, Arne Roland <A(dot)Roland(at)index(dot)de>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>, vignesh C <vignesh21(at)gmail(dot)com>, Justin Pryzby <pryzby(at)telsasoft(dot)com>, Ibrar Ahmed <ibrar(dot)ahmad(at)gmail(dot)com>, "stephen(dot)frost" <stephen(dot)frost(at)crunchydata(dot)com>
Subject: Re: Add the ability to limit the amount of memory that can be allocated to backends.
Date: 2024-01-28 19:11:56
Message-ID: 1c5f1856-817d-45e5-8e1a-acd95c6dd335@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

I took a closer look at this patch over the last couple days, and I did
a bunch of benchmarks to see how expensive the accounting is. The good
news is that I haven't observed any overhead - I did two simple tests,
that I think would be particularly sensitive to this:

1) regular "pgbench -S" with scale 10, so tiny and not hitting I/O

2) "select count(*) from generate_series(1,N)" where N is either 10k or
1M, which should be enough to cause a fair number of allocations

And I tested this on three branches - master (no patches applied),
patched (but limit=0, so just accounting) and patched-limit (with limit
set to 4.5GB, so high enough not to be hit).

Attached are script and raw results for both benchmarks from two
machines, i5 (small, 4C) and xeon (bigger, 16C/32C), and a PDF showing
the results as candlestick charts (with 1 and 2 sigma intervals). AFAICS
there's no measurable difference between master and patched builds.
Which is good, clearly the local allowance makes the overhead negligible
(at least in these benchmarks).

Now, for the patch itself - I already said some of the stuff about a
month ago [1], but I'll repeat some of it for the sake of completeness
before I get to comments about the code etc.

Firstly, I agree with the goal of having a way to account for memory
used by the backends, and also ability to enforce some sort of limit.
It's difficult to track the memory at the OS level (interpreting RSS
values is not trivial), and work_mem is not sufficient to enforce a
backend-level limit, not even talking about a global limit.

But as I said earlier, it seems quite strange to start by introducing
some sort of global limit, combining memory for all backends. I do
understand that the intent is to have such global limit in order to
prevent issues with the OOM killer and/or not to interfere with other
stuff running on the same machine. And while I'm not saying we should
not have such limit, every time I wished to have a memory limit it was a
backend-level one. Ideally a workmem-like limit that would "adjust" the
work_mem values used by the optimizer (but that's not what this patch
aims to do), or at least a backstop in case something goes wrong (say, a
memory leak, OLTP application issuing complex queries, etc.).

The accounting and infrastructure introduced by the patch seems to be
suitable for both types of limits (global and per-backend), but then it
goes for the global limit first. It may seem simpler to implement, but
in practice there's a bunch of problems mentioned earlier, but ignored.

For example, I really don't believe it's OK to just report the error in
the first backend that happens to hit the limit. Bertrand mentioned that
[2], asking about a situation when a runaway process allocates 99% of
memory. The response to that was

> Initially, we believe that punishing the detector is reasonable if we
> can help administrators avoid the OOM killer/resource starvation. But
> we can and should expand on this idea.

which I find rather unsatisfactory. Belief is not an argument, and it
relies on the assumption that this helps the administrator to avoid the
OOM killer etc. ISTM it can easily mean the administrator can't even
connect to the database, run a query (to inspect the new system views),
etc. because any of that would hit the memory limit. That seems more
like a built-in DoS facility ...

If this was up to me, I'd probably start with the per-backend limit. But
that's my personal opinion.

Now, some comments about the code etc. Stephen was promising a new patch
version in October [6], but that didn't happen yet, so I looked at the
patches I rebased in December.

0001
----

1) I don't see why we should separate memory by context type - without
knowing which exact backends are using the memory, this seems pretty
irrelevant/useless. Either it's private or shared memory, I don't think
the accounting should break this into more counters. Also, while I'm not
aware of anyone proposing a new memory context type, I doubt we'd want
to add more and more counters if that happens.

suggestion: only two counters, for local and shared memory

If we absolutely want to have per-type counters, we should not mix the
private memory and shared memory ones.

2) I have my doubts about the tracking of shared memory. It seems weird
to count them only into the backend that allocated them, and transfer
them to "global" on process exit. Surely we should know which shared
memory is meant to be long-lived from the start, no?

For example, it's not uncommon that an extension allocates a chunk of
shared memory to store some sort of global state (e.g. BDR/pglogical
does that, I'm sure other extensions do that). But the process that
allocates the shared memory keeps running. AFAICS this would be tracked
as backend DSM memory, which I'm not sure this is what we want ...

And I'm not alone in thinking this should work differently - [3], [4].

3) We limit the floor of allocation counters to zero.

This seems really weird. Why would it be necessary? I mean, how could we
get a negative amount of memory? Seems like it might easily mask issues
with incorrect accounting, or something.

4) pg_stat_memory_allocation

Maybe pg_stat_memory_usage would be a better name ...

5) The comments/docs repeatedly talk about "dynamic nature" and that it
makes the values not exact:

> Due to the dynamic nature of memory allocations the allocated bytes
> values may not be exact but should be sufficient for the intended
> purposes.

I don't understand what "dynamic nature" refers to, and why would it
make the values not exact. Presumably this refers to the "allowance" but
how is that "dynamic nature"?

This definitely needs to explain this better, with some basic estimate
how how accurate the values are expected to be.

6) The SGML docs keep recommending to use pg_size_pretty(). I find that
unnecessary, no other docs reporting values in bytes do that.

7) I see no reason to include shared_memory_size_mb in the view, and
same for shared_memory_size_in_huge_pages. It has little to do with
"allocated" memory, IMO. And a value in "MB" goes directly against the
suggestion to use pg_size_pretty().

suggestion: drop this from the view

8) In a lot of places we do

context->mem_allocated += blksize;
pgstat_report_allocated_bytes_increase(blksize, PG_ALLOC_ASET);

Maybe we should integrate the two types of accounting, wrap them into a
single function call, or something? This makes it very simple to forget
updating one of those places. AFAIC the patch tries to minimize the
number of updates of the new shared counters, but with the allowance
that should not be an issue I think.

9) This seems wrong. Why would it be OK to ever overflow? Isn't that a
sign of incorrect accounting?

/* Avoid allocated_bytes unsigned integer overflow on decrease */
if (pg_sub_u64_overflow(*my_allocated_bytes, proc_allocated_bytes,
&temp))
{
/* On overflow, set allocated bytes and allocator type bytes to
zero */
*my_allocated_bytes = 0;
*my_aset_allocated_bytes = 0;
*my_dsm_allocated_bytes = 0;
*my_generation_allocated_bytes = 0;
*my_slab_allocated_bytes = 0;
}

9) How could any of these values be negative? It's all capped to 0 and
also stored in uint64. Seems pretty useless.

+SELECT
+ datid > 0, pg_size_bytes(shared_memory_size) >= 0,
shared_memory_size_in_huge_pages >= -1, global_dsm_allocated_bytes >= 0
+FROM
+ pg_stat_global_memory_allocation;
+ ?column? | ?column? | ?column? | ?column?
+----------+----------+----------+----------
+ t | t | t | t
+(1 row)
+
+-- ensure that pg_stat_memory_allocation view exists
+SELECT
+ pid > 0, allocated_bytes >= 0, aset_allocated_bytes >= 0,
dsm_allocated_bytes >= 0, generation_allocated_bytes >= 0,
slab_allocated_bytes >= 0

0003
----

1) The commit message says:

> Further requests will not be allocated until dropping below the limit.
> Keep this in mind when setting this value.

The SGML docs have a similar "keep this in mind" suggestion in relation
to the 1MB local allowance. I find this pretty useless, as it doesn't
really say what to do with the information / what it means. I mean,
should I set the value higher, or what am I supposed to do? This needs
to be understandable for average user reading the SGML docs, who is
unlikely to know the implementation details.

2) > This limit does not affect auxiliary backend processes.

This seems pretty unfortunate, because in the cases where I actually saw
OOM killer to intervene, it was often because of auxiliary processes
allocating a lot of memory (say, autovacuum with maintenance_work_mem
set very high, etc.).

In a way, not excluding these auxiliary processes from the limit seems
to go against the idea of preventing the OOM killer.

3) I don't think the patch ever explains what the "local allowance" is,
how the refill works, why it's done this way, etc. I do think I
understand that now, but I had to go through the thread, That's not
really how it should be. There should be an explanation of how this
works somewhere (comment in mcxt.c? separate README?).

4) > doc/src/sgml/monitoring.sgml

Not sure why this removes the part about DSM being included only in the
backend that created it.

5) > total_bkend_mem_bytes_available

The "bkend" name is strange (to save 2 characters), and the fact how it
combines local and shared memory seems confusing too.

6)
+ /*
+ * Local counter to manage shared memory allocations. At backend
startup, set to
+ * initial_allocation_allowance via pgstat_init_allocated_bytes().
Decrease as
+ * memory is malloc'd. When exhausted, atomically refill if available from
+ * ProcGlobal->max_total_bkend_mem via exceeds_max_total_bkend_mem().
+ */
+uint64 allocation_allowance = 0;

But this allowance is for shared memory too, and shared memory is not
allocated using malloc.

7) There's a lot of new global variables. Maybe it'd be better to group
them into a struct, or something?

8) Why does pgstat_set_allocated_bytes_storage need the new return?

9) Unnecessary changes in src/backend/utils/hash/dynahash.c (whitespace,
new comment that seems not very useful)

10) Shouldn't we have a new malloc wrapper that does the check? That way
we wouldn't need to have the call in every memory context place calling
malloc.

11) I'm not sure about the ereport() calls after hitting the limit.
Adres thinks it might lead to recursion [4], but Stephen [5] seems to
think this does not really make the situation worse. I'm not sure about
that, though - I agree the ENOMEM can already happen, but but maybe
having a limit (which clearly needs to be more stricter than the limit
used by kernel for OOM) would be more likely to hit?

12) In general, I agree with Andres [4] that we'd be better of to focus
on the accounting part, see how it works in practice, and then add some
ability to limit memory after a while.

regards

[1]
https://www.postgresql.org/message-id/4fb99fb7-8a6a-2828-dd77-e2f1d75c7dd0%40enterprisedb.com

[2]
https://www.postgresql.org/message-id/3b9b90c6-f4ae-a7df-6519-847ea9d5fe1e%40amazon.com

[3]
https://www.postgresql.org/message-id/20230110023118.qqbjbtyecofh3uvd%40awork3.anarazel.de

[4]
https://www.postgresql.org/message-id/20230113180411.rdqbrivz5ano2uat%40awork3.anarazel.de

[5]
https://www.postgresql.org/message-id/ZTArWsctGn5fEVPR%40tamriel.snowman.net

[6]
https://www.postgresql.org/message-id/ZTGycSYuFrsixv6q%40tamriel.snowman.net

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment Content-Type Size
results-count.pdf application/pdf 388.3 KB
results-pgbench.pdf application/pdf 228.4 KB
scripts.tgz application/x-compressed-tar 768 bytes
xeon-count.tgz application/x-compressed-tar 49.7 KB
xeon-pgbench.tgz application/x-compressed-tar 29.9 KB
i5-count.tgz application/x-compressed-tar 44.5 KB
i5-pgbench.tgz application/x-compressed-tar 28.5 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Pavel Stehule 2024-01-28 19:34:40 Re: Schema variables - new implementation for Postgres 15
Previous Message Pavel Stehule 2024-01-28 19:00:57 Re: proposal: psql: show current user in prompt