Quick Links

Re: PostgreSQL 14.4 ERROR: out of memory issues

From:	Aleš Zelený <zeleny(dot)ales(at)gmail(dot)com>
To:	Michael Paquier <michael(at)paquier(dot)xyz>
Cc:	"pgsql-generallists(dot)postgresql(dot)org" <pgsql-general(at)lists(dot)postgresql(dot)org>
Subject:	Re: PostgreSQL 14.4 ERROR: out of memory issues
Date:	2022-07-14 19:25:54
Message-ID:	CAODqTUbZG+Fp-NfynbATfGf0y=xsi1XfwBCDQWgXWeAcVOFXZQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-general

st 13. 7. 2022 v 2:20 odesílatel Michael Paquier <michael(at)paquier(dot)xyz>
napsal:

> On Mon, Jul 11, 2022 at 10:50:23AM +0200, Aleš Zelený wrote:
> > So far, it has happened three times (during a single week) from the 14.3
> ->
> > 14.4 upgrade, before 14.4 we haven't suffered from such an issue.
> >
> > Questions:
> > 1) Can we safely downgrade from 14.4 to 14.3 by shutting down the
> instance
> > and reinstalling 14.3 PG packages (to prove, that the issue disappear)?
> > 2) What is the best way to diagnose what is the root cause?
>
> Hmm. 14.4 has nothing in its release notes that would point to a
> change in the vacuum or autovacuum's code paths:
> https://www.postgresql.org/docs/14/release-14-4.html#id-1.11.6.5.4
>
> There is nothing specific after a look at the changes as of, and I am
> not grabbing anything that would imply a change in memory context
> handling either:
> `git log --stat REL_14_3..REL_14_4`
> `git diff REL_14_3..REL_14_4 -- *.c`
>
> Saying that, you should be able to downgrade safely as there are no
> changes in WAL format or such that would break things. Saying that,
> the corruption issue caused by CONCURRENTLY is something you'd still
> have to face.
>
>
Thanks, good to know that, we can use it for a test case, since we already
hit the CONCURRENTLY bug on 14.3.

> 2022-07-02 14:48:07 CEST [3930]: [3-1] user=,db=,host=,app= ERROR: out of
> > memory
> > 2022-07-02 14:48:07 CEST [3930]: [4-1] user=,db=,host=,app= DETAIL:
> Failed
> > on request of size 152094068 in memory context "TopTransactionContext".
> > 2022-07-02 14:48:07 CEST [3930]: [5-1] user=,db=,host=,app= CONTEXT:
> > automatic vacuum of table "prematch.replication.tab_queue_tmp"
>
> This is the interesting part. Do you happen to use logical
> replication in a custom C++ plugin?
>

We are using logical replication to other instances (pg_output) and
decoderbufs
https://github.com/debezium/postgres-decoderbufs for other applications.

> > 2022-07-02 14:48:47 CEST [4476]: [43-1] user=,db=,host=,app= LOG: could
> > not fork worker process: Cannot allocate memory
> > terminate called after throwing an instance of 'std::bad_alloc'
> > what(): std::bad_alloc
> >
> > DETAIL: parameters: $1 = '1', $2 = '1748010445', $3 = '0', $4 = '1000'
> > terminate
> > called after throwing an instance of 'std::bad_alloc' terminate called
> > after throwing an instance of 'std::bad_alloc' what(): what():
> > std::bad_allocstd::bad_alloc 2022-07-08 14:54:23 CEST [4476]: [49-1]
> > user=,db=,host=,app= LOG: background worker "parallel worker" (PID 25251)
> > was terminated by signal 6: Aborted
> > 2022-07-08 14:54:23 CEST [4476]: [51-1] user=,db=,host=,app= LOG:
> > terminating any other active server processes
>
> Looks like something is going wrong in the memory handling of one of
> your C++ extensions here. If you can isolate an issue using a query
> without any custom code, that would be a Postgres problem, but I think
> that you are missing a trick in it.
>

When we started experiencing these issues, based on Prometheus
node-exporter statistics, I've begun collecting /proc/<pid>/status for all
PG instance processes in an infinite loop.
From this, I can extract some data. It looks like RssAnon memory is what I
can see growing from several hundred kB to 1-2GB over some period (depends
on workload, but usually 24 hours).
The sessions where RssAnon memory growth are PHP client applications,
having multiple connections it might consume 10GB of memory. The same
application (I've asked for deployed changes, but no one was released close
to the upgrade time).
Comparing Prometheus server data, RssAnon seems to directly follow
node_memory_Active_anon_bytes node exporter metrics. If the "suspected"
sessions disconnects, node_memory_Active_anon_bytes drops down.
I've checked node_memory_Active_anon_bytes metric before the upgrade and
(same application workload) it was pretty stable on PG 13.8, later 14.3 but
immediately after the upgrade, node_memory_Active_anon_bytes start growth,
and once the Commit limit was reached, the fork()/Out of memory issues
appears.

So for the application user running the PHP, I've set
log_min_duration_statemenet = 0, but no unexpected statements or arguments
(some functions have int array argument) were found in the logfile.

Checking the RssAnon from proc/pid/status I've found some points where
RssAnon memory usage grew very steep for a minute, but no "suspicious"
queries/arguments were found in the instance logfile.

Any hint, on how to get the root cause would be appreciated since so far
I've failed to isolate the issue reproducible testcase.
At least I hope that looking for the RssAnon process memory is an
appropriate metric, if not, let me know and I'll try to update the
monitoring to get the root cause.

I can imagine a workaround with client application regular reconnect...,
but u to 14.3 it works, so I'd like to fix the issue either on our
application side or at PG side if it is a PG problem.

Kind regards Ales Zeleny

--
> Michael
>

In response to

Re: PostgreSQL 14.4 ERROR: out of memory issues at 2022-07-13 00:20:51 from Michael Paquier

Responses

Re: PostgreSQL 14.4 ERROR: out of memory issues at 2022-07-14 19:39:31 from Pavel Stehule
Re: PostgreSQL 14.4 ERROR: out of memory issues at 2022-07-14 21:11:23 from Tomas Vondra

Browse pgsql-general by date

	From	Date	Subject
Next Message	Pavel Stehule	2022-07-14 19:39:31	Re: PostgreSQL 14.4 ERROR: out of memory issues
Previous Message	Bruce Momjian	2022-07-14 19:10:23	Re: Oracle to Postgress Migration