Re: BUG #17974: Walsenders memory usage suddenly spike to 80G+ causing OOM and server reboot

From: Michael Guissine <mguissine(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Michael Paquier <michael(at)paquier(dot)xyz>, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #17974: Walsenders memory usage suddenly spike to 80G+ causing OOM and server reboot
Date: 2023-06-19 19:36:22
Message-ID: CACxDrA=e_YsY4vh4MvZHG-BqLXM2k1-cZb3oeTjm1HJdm7vpxw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

I should stand corrected @Andres Freund <andres(at)anarazel(dot)de> , the memory
DID recover immediately after reboot, but quickly start dropping again
shortly after. It fully recovered only after we dropped and recreated all
replication slots.

On Wed, Jun 14, 2023 at 6:15 PM Andres Freund <andres(at)anarazel(dot)de> wrote:

> Hi,
>
> On 2023-06-14 10:23:32 +0900, Michael Paquier wrote:
> > On Wed, Jun 14, 2023 at 12:05:32AM +0000, PG Bug reporting form wrote:
> > > We are running relatively large and busy Postgres database on RDS and
> using
> > > logical replication extensively. We currently have 7 walsenders and
> while we
> > > often see replication falls behind due to high transactional volume,
> we've
> > > never experienced memory issues in 14.6 and below. After recent
> upgrade to
> > > 14.8, we already had several incidents where walsender processes RES
> memory
> > > would suddenly increase to over 80GB each causing freeable memory on
> the
> > > instance to go down to zero.
>
> When postgres knows it ran out of memory (instead of having gotten killed
> by
> the OOM killer), it'll dump memory context information to the log. Could
> you
> check whether there are related log entries? They should precede an "out
> of
> memory" ERROR.
>
>
> > > Interesting that even after Instance reboot,
> > > the memory used by walsender processes won't get released until we
> restart
> > > the replication and drop the logical slots. The
> logical_decoding_work_mem
> > > was set to 512MB in time of the last incident but we recently lowered
> it to
> > > 128MB.
>
> That seems very unlikely to be the case. If you restarted postgres or
> postgres
> and the OS, there's nothing to have allocated the memory. What exactly do
> you
> mean by "Instance reboot"?
>
>
> > > Any known issues in pg 14.8 that would trigger this behaviour?
> >
> > Yes, there are known issues with memory handling in logical
> > replication setups. See for example this thread:
> >
> https://www.postgresql.org/message-id/CAMnUB3oYugXCBLSkih+qNsWQPciEwos6g_AMbnz_peNoxfHwyw@mail.gmail.com
>
> Why would 14.8 have made that problem worse?
>
> Greetings,
>
> Andres Freund
>

In response to

Browse pgsql-bugs by date

  From Date Subject
Next Message Tom Lane 2023-06-19 20:59:15 Re: BUG #17978: Unexpected error: "wrong varnullingrels (b) (expected (b 5)) for Var 6/2" triggered by JOIN
Previous Message PG Bug reporting form 2023-06-19 19:00:01 BUG #17983: Assert IsTransactionState() failed when empty string statement prepared in aborted transaction