Re: Patch: New GUC prepared_statement_limit to limit memory used by prepared statements

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
Cc: Daniel Migowski <dmigowski(at)ikoffice(dot)de>, Ibrar Ahmed <ibrar(dot)ahmad(at)gmail(dot)com>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Konstantin Knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru>
Subject: Re: Patch: New GUC prepared_statement_limit to limit memory used by prepared statements
Date: 2019-09-03 21:27:21
Message-ID: 6391.1567546041@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Alvaro Herrera <alvherre(at)2ndquadrant(dot)com> writes:
> On this patch, beyond the fact that it's causing a crash in the
> regression tests as evidenced by the CFbot, we seem to be waiting on the
> input of the larger community on whether it's a desired feature or not.
> We have Kyotaro's vote for it, but it would be good to get more.

I'm pretty dubious about this (just as I am with the somewhat-related
topic of syscache/relcache size limits). It's hard to tell where
performance is going to fall off a cliff if you limit the cache size.

Another point is that, because this only gets rid of the regeneratable
parts of a plancache entry, it isn't really going to move the needle
all that far in terms of total space consumption. As an experiment,
I instrumented GetCachedPlan right after where it fills in the gplan field
to see the relative sizes of the cache entry's contexts. Running that
over the core + PL regression tests, I get

14 GetCachedPlan: 1024 context, 1024 query_context, 1024 gplan
4 GetCachedPlan: 1024 context, 2048 query_context, 1024 gplan
1 GetCachedPlan: 1024 context, 2048 query_context, 2048 gplan
2109 GetCachedPlan: 2048 context, 2048 query_context, 2048 gplan
29 GetCachedPlan: 2048 context, 2048 query_context, 4096 gplan
6 GetCachedPlan: 2048 context, 4096 query_context, 2048 gplan
33 GetCachedPlan: 2048 context, 4096 query_context, 4096 gplan
2 GetCachedPlan: 2048 context, 4096 query_context, 8192 gplan
1 GetCachedPlan: 2048 context, 8192 query_context, 16384 gplan
4 GetCachedPlan: 2048 context, 8192 query_context, 4096 gplan
2 GetCachedPlan: 2048 context, 8192 query_context, 8192 gplan
8 GetCachedPlan: 3480 context, 8192 query_context, 8192 gplan
250 GetCachedPlan: 4096 context, 2048 query_context, 2048 gplan
107 GetCachedPlan: 4096 context, 2048 query_context, 4096 gplan
3 GetCachedPlan: 4096 context, 4096 query_context, 16384 gplan
1 GetCachedPlan: 4096 context, 4096 query_context, 2048 gplan
7 GetCachedPlan: 4096 context, 4096 query_context, 32768 gplan
190 GetCachedPlan: 4096 context, 4096 query_context, 4096 gplan
61 GetCachedPlan: 4096 context, 4096 query_context, 8192 gplan
11 GetCachedPlan: 4096 context, 8192 query_context, 4096 gplan
587 GetCachedPlan: 4096 context, 8192 query_context, 8192 gplan
1 GetCachedPlan: 4096 context, 16384 query_context, 8192 gplan
5 GetCachedPlan: 4096 context, 32768 query_context, 32768 gplan
1 GetCachedPlan: 4096 context, 65536 query_context, 65536 gplan
12 GetCachedPlan: 8192 context, 4096 query_context, 4096 gplan
2 GetCachedPlan: 8192 context, 4096 query_context, 8192 gplan
49 GetCachedPlan: 8192 context, 8192 query_context, 16384 gplan
46 GetCachedPlan: 8192 context, 8192 query_context, 8192 gplan
10 GetCachedPlan: 8192 context, 16384 query_context, 16384 gplan
1 GetCachedPlan: 8192 context, 16384 query_context, 32768 gplan
1 GetCachedPlan: 8192 context, 16384 query_context, 8192 gplan
1 GetCachedPlan: 8192 context, 32768 query_context, 32768 gplan
2 GetCachedPlan: 8192 context, 131072 query_context, 131072 gplan
3 GetCachedPlan: 16384 context, 8192 query_context, 16384 gplan
1 GetCachedPlan: 16384 context, 16384 query_context, 16384 gplan
2 GetCachedPlan: 16384 context, 16384 query_context, 17408 gplan
1 GetCachedPlan: 16384 context, 16384 query_context, 32768 gplan
1 GetCachedPlan: 16384 context, 16384 query_context, 65536 gplan

(The first column is the number of occurrences of the log entry;
I got this list from "grep|sort|uniq -c" on the postmaster log.)
Patch for this is attached, just in the interests of full disclosure.

So yeah, there are cases where you can save a whole lot by deleting
the query_context and/or gplan, but they're pretty few and far between;
more commonly, the nonreclaimable data in "context" accounts for a third
of the usage.

(BTW, it looks to me like the code tries to keep the *total* usage under
the GUC limit, not the freeable usage. Which means it won't be hard at all
to drive it to the worst case where it tries to free everything all the
time, if the nonreclaimable data is already over the limit.)

Admittedly, the regression tests might well not be representative of
common usage, but if you don't want to take them as a benchmark then
we need some other benchmark we can look at.

I also notice that this doesn't seem to be doing anything with
CachedExpressions, which are likely to be a pretty big factor in
the usage of e.g. plpgsql.

Now, I'd be the first to say that my thoughts about this are probably
biased by my time at Salesforce, where their cache-consumption problems
were driven by lots and lots and lots and lots (and lots) of plpgsql code.
Maybe with another usage pattern you'd come to a different conclusion,
but if I were trying to deal with that situation, what I'd look at
doing is reclaiming stuff starting at the plpgsql function cache level,
and then cascading down to the plancache entries referenced by a plpgsql
function body you've chosen to free. One major advantage of doing that
is that plpgsql has a pretty clear idea of when a given function cache
instance has gone to zero refcount, whereas plancache.c simply doesn't
know that.

As far as the crash issue is concerned, I notice that right now the
cfbot is showing green for this patch, but that seems to just be because
the behavior is unstable. I see crashes in "make installcheck-parallel"
about 50% of the time with this patch applied. Since, in fact,
the patch is not supposed to be doing anything at all with
prepared_statement_limit set to zero, that suggests sloppiness in the
refactoring that was done to separate out the resource-freeing code.

On the other hand, if I do ALTER SYSTEM SET prepared_statement_limit = 1
and then run "make installcheck-parallel", I see a different set of
failures. It rather looks like the patch is deleting plans out from
under plpgsql, which connects back to my point about plancache.c not
really knowing whether a plan is in use or not. Certainly,
EnforcePreparedStatementLimit as coded here has no idea about that.

(Speaking of ALTER SYSTEM, why the devil is the variable PGC_POSTMASTER?
That seems entirely silly.)

Aside from Alvaro's style points, I'm fairly concerned about the overhead
the patch will add when active. Running through the entire plancache
and collecting memory stats is likely to be quite an expensive
proposition when there are lots of entries, yet this is willing to do that
over again at the drop of a hat. It won't be hard to get this to expend
O(N^2) time with N entries. I think that for acceptable performance,
it'll be necessary to track total usage incrementally instead of doing
it this way.

regards, tom lane

Attachment Content-Type Size
instrumentation-patch.txt text/plain 2.4 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2019-09-03 21:35:05 Re: Commit fest 2019-09
Previous Message Ibrar Ahmed 2019-09-03 21:16:54 Re: Commit fest 2019-09