Re: PostgreSQL 14.4 ERROR: out of memory issues

From: Aleš Zelený <zeleny(dot)ales(at)gmail(dot)com>
To: Michael Paquier <michael(at)paquier(dot)xyz>
Cc: "pgsql-generallists(dot)postgresql(dot)org" <pgsql-general(at)lists(dot)postgresql(dot)org>
Subject: Re: PostgreSQL 14.4 ERROR: out of memory issues
Date: 2022-07-14 19:25:54
Message-ID: CAODqTUbZG+Fp-NfynbATfGf0y=xsi1XfwBCDQWgXWeAcVOFXZQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

st 13. 7. 2022 v 2:20 odesílatel Michael Paquier <michael(at)paquier(dot)xyz>
napsal:

> On Mon, Jul 11, 2022 at 10:50:23AM +0200, Aleš Zelený wrote:
> > So far, it has happened three times (during a single week) from the 14.3
> ->
> > 14.4 upgrade, before 14.4 we haven't suffered from such an issue.
> >
> > Questions:
> > 1) Can we safely downgrade from 14.4 to 14.3 by shutting down the
> instance
> > and reinstalling 14.3 PG packages (to prove, that the issue disappear)?
> > 2) What is the best way to diagnose what is the root cause?
>
> Hmm. 14.4 has nothing in its release notes that would point to a
> change in the vacuum or autovacuum's code paths:
> https://www.postgresql.org/docs/14/release-14-4.html#id-1.11.6.5.4
>
> There is nothing specific after a look at the changes as of, and I am
> not grabbing anything that would imply a change in memory context
> handling either:
> `git log --stat REL_14_3..REL_14_4`
> `git diff REL_14_3..REL_14_4 -- *.c`
>
> Saying that, you should be able to downgrade safely as there are no
> changes in WAL format or such that would break things. Saying that,
> the corruption issue caused by CONCURRENTLY is something you'd still
> have to face.
>
>
Thanks, good to know that, we can use it for a test case, since we already
hit the CONCURRENTLY bug on 14.3.

> 2022-07-02 14:48:07 CEST [3930]: [3-1] user=,db=,host=,app= ERROR: out of
> > memory
> > 2022-07-02 14:48:07 CEST [3930]: [4-1] user=,db=,host=,app= DETAIL:
> Failed
> > on request of size 152094068 in memory context "TopTransactionContext".
> > 2022-07-02 14:48:07 CEST [3930]: [5-1] user=,db=,host=,app= CONTEXT:
> > automatic vacuum of table "prematch.replication.tab_queue_tmp"
>
> This is the interesting part. Do you happen to use logical
> replication in a custom C++ plugin?
>

We are using logical replication to other instances (pg_output) and
decoderbufs
https://github.com/debezium/postgres-decoderbufs for other applications.

> > 2022-07-02 14:48:47 CEST [4476]: [43-1] user=,db=,host=,app= LOG: could
> > not fork worker process: Cannot allocate memory
> > terminate called after throwing an instance of 'std::bad_alloc'
> > what(): std::bad_alloc
> >
> > DETAIL: parameters: $1 = '1', $2 = '1748010445', $3 = '0', $4 = '1000'
> > terminate
> > called after throwing an instance of 'std::bad_alloc' terminate called
> > after throwing an instance of 'std::bad_alloc' what(): what():
> > std::bad_allocstd::bad_alloc 2022-07-08 14:54:23 CEST [4476]: [49-1]
> > user=,db=,host=,app= LOG: background worker "parallel worker" (PID 25251)
> > was terminated by signal 6: Aborted
> > 2022-07-08 14:54:23 CEST [4476]: [51-1] user=,db=,host=,app= LOG:
> > terminating any other active server processes
>
> Looks like something is going wrong in the memory handling of one of
> your C++ extensions here. If you can isolate an issue using a query
> without any custom code, that would be a Postgres problem, but I think
> that you are missing a trick in it.
>

Here are extensions installed in the database served by the cluster:
prematch=# \dx
List of installed extensions
Name | Version | Schema |
Description
--------------------+---------+------------+------------------------------------------------------------------------
amcheck | 1.3 | public | functions for verifying
relation integrity
dblink | 1.2 | public | connect to other PostgreSQL
databases from within a database
file_fdw | 1.0 | public | foreign-data wrapper for flat
file access
hstore | 1.8 | public | data type for storing sets of
(key, value) pairs
hypopg | 1.3.1 | public | Hypothetical indexes for
PostgreSQL
pageinspect | 1.9 | public | inspect the contents of
database pages at a low level
pg_buffercache | 1.3 | public | examine the shared buffer cache
pg_stat_kcache | 2.2.0 | public | Kernel statistics gathering
pg_stat_statements | 1.9 | public | track planning and execution
statistics of all SQL statements executed
pgcrypto | 1.3 | public | cryptographic functions
pgstattuple | 1.5 | public | show tuple-level statistics
plpgsql | 1.0 | pg_catalog | PL/pgSQL procedural language
plpgsql_check | 2.1 | public | extended check for plpgsql
functions
postgres_fdw | 1.1 | public | foreign-data wrapper for
remote PostgreSQL servers
tablefunc | 1.0 | public | functions that manipulate
whole tables, including crosstab
(15 rows)

When we started experiencing these issues, based on Prometheus
node-exporter statistics, I've begun collecting /proc/<pid>/status for all
PG instance processes in an infinite loop.
From this, I can extract some data. It looks like RssAnon memory is what I
can see growing from several hundred kB to 1-2GB over some period (depends
on workload, but usually 24 hours).
The sessions where RssAnon memory growth are PHP client applications,
having multiple connections it might consume 10GB of memory. The same
application (I've asked for deployed changes, but no one was released close
to the upgrade time).
Comparing Prometheus server data, RssAnon seems to directly follow
node_memory_Active_anon_bytes node exporter metrics. If the "suspected"
sessions disconnects, node_memory_Active_anon_bytes drops down.
I've checked node_memory_Active_anon_bytes metric before the upgrade and
(same application workload) it was pretty stable on PG 13.8, later 14.3 but
immediately after the upgrade, node_memory_Active_anon_bytes start growth,
and once the Commit limit was reached, the fork()/Out of memory issues
appears.

So for the application user running the PHP, I've set
log_min_duration_statemenet = 0, but no unexpected statements or arguments
(some functions have int array argument) were found in the logfile.

Checking the RssAnon from proc/pid/status I've found some points where
RssAnon memory usage grew very steep for a minute, but no "suspicious"
queries/arguments were found in the instance logfile.

Any hint, on how to get the root cause would be appreciated since so far
I've failed to isolate the issue reproducible testcase.
At least I hope that looking for the RssAnon process memory is an
appropriate metric, if not, let me know and I'll try to update the
monitoring to get the root cause.

I can imagine a workaround with client application regular reconnect...,
but u to 14.3 it works, so I'd like to fix the issue either on our
application side or at PG side if it is a PG problem.

Kind regards Ales Zeleny

--
> Michael
>

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Pavel Stehule 2022-07-14 19:39:31 Re: PostgreSQL 14.4 ERROR: out of memory issues
Previous Message Bruce Momjian 2022-07-14 19:10:23 Re: Oracle to Postgress Migration