Re: Logical replication 'invalid memory alloc request size 1585837200' after upgrading to 17.5

From: Duncan Sands <duncan(dot)sands(at)deepbluecap(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: pgsql-bugs(at)lists(dot)postgresql(dot)org, Shlok Kyal <shlok(dot)kyal(dot)oss(at)gmail(dot)com>
Subject: Re: Logical replication 'invalid memory alloc request size 1585837200' after upgrading to 17.5
Date: 2025-05-21 11:30:58
Message-ID: f0b728d5-0061-46d2-a52c-7babd8b6024f@deepbluecap.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Hi Amit and Shlok, thanks for thinking about this issue. We are working on
reproducing it in our test environment. Since it seems likely to be related to
our primary database being very busy with lots of concurrency and large
transactions, we are starting by creating a streaming replication copy of our
primary server (this copy to run 17.5, with the primary on 17.4), with the idea
of then doing logical replication from the standby to see if we hit the same
issue. If so, that gives something to poke at, and we can move on to something
better from there.

Best wishes, Duncan.

On 21/05/2025 07:48, Amit Kapila wrote:
> On Mon, May 19, 2025 at 8:08 PM Duncan Sands
> <duncan(dot)sands(at)deepbluecap(dot)com> wrote:
>>
>> PostgreSQL v17.5 (Ubuntu 17.5-1.pgdg24.04+1); Ubuntu 24.04.2 LTS (kernel
>> 6.8.0); x86-64
>>
>> Good morning from DeepBlueCapital. Soon after upgrading to 17.5 from 17.4, we
>> started seeing logical replication failures with publisher errors like this:
>>
>> ERROR: invalid memory alloc request size 1196493216
>>
>> (the exact size varies). Here is a typical log extract from the publisher:
>>
>> 2025-05-19 10:30:14 CEST \[1348336-465] remote\_production\_user\(at)blue DEBUG:
>> 00000: write FB03/349DEF90 flush FB03/349DEF90 apply FB03/349DEF90 reply\_time
>> 2025-05-19 10:30:07.467048+02
>> 2025-05-19 10:30:14 CEST \[1348336-466] remote\_production\_user\(at)blue LOCATION:
>> ProcessStandbyReplyMessage, walsender.c:2431
>> 2025-05-19 10:30:14 CEST \[1348336-467] remote\_production\_user\(at)blue DEBUG:
>> 00000: skipped replication of an empty transaction with XID: 207637565
>> 2025-05-19 10:30:14 CEST \[1348336-468] remote\_production\_user\(at)blue CONTEXT:
>> slot "jnb\_production", output plugin "pgoutput", in the commit callback,
>> associated LSN FB03/349FF938
>> 2025-05-19 10:30:14 CEST \[1348336-469] remote\_production\_user\(at)blue LOCATION:
>> pgoutput\_commit\_txn, pgoutput.c:629
>> 2025-05-19 10:30:14 CEST \[1348336-470] remote\_production\_user\(at)blue DEBUG:
>> 00000: UpdateDecodingStats: updating stats 0x5ae1616c17a8 0 0 0 0 1 0 1 191
>> 2025-05-19 10:30:14 CEST \[1348336-471] remote\_production\_user\(at)blue LOCATION:
>> UpdateDecodingStats, logical.c:1943
>> 2025-05-19 10:30:14 CEST \[1348336-472] remote\_production\_user\(at)blue DEBUG:
>> 00000: found top level transaction 207637519, with catalog changes
>> 2025-05-19 10:30:14 CEST \[1348336-473] remote\_production\_user\(at)blue LOCATION:
>> SnapBuildCommitTxn, snapbuild.c:1150
>> 2025-05-19 10:30:14 CEST \[1348336-474] remote\_production\_user\(at)blue DEBUG:
>> 00000: adding a new snapshot and invalidations to 207616976 at FB03/34A1AAE0
>> 2025-05-19 10:30:14 CEST \[1348336-475] remote\_production\_user\(at)blue LOCATION:
>> SnapBuildDistributeSnapshotAndInval, snapbuild.c:915
>> 2025-05-19 10:30:14 CEST \[1348336-476] remote\_production\_user\(at)blue ERROR:
>> XX000: invalid memory alloc request size 1196493216
>>
>> If I'm reading it right, things go wrong on the publisher while preparing the
>> message, i.e. it's not a subscriber problem.
>>
>
> Right, I also think so.
>
>> This particular instance was triggered by a large number of catalog
>> invalidations: I dumped what I think is the relevant WAL with "pg_waldump -s
>> FB03/34A1AAE0 -p 17/main/ --xid=207637519" and the output was a single long line:
>>
> ...
> ...
>>
>> While it is long, it doesn't seem to merit allocating anything like 1GB of
>> memory. So I'm guessing that postgres is miscalculating the required size somehow.
>>
>
> We fixed a bug in commit 4909b38af0 to distribute invalidation at the
> transaction end to avoid data loss in certain cases, which could cause
> such a problem. I am wondering that even prior to that commit, we
> would eventually end up allocating the required memory for a
> transaction for all the invalidations because of repalloc in
> ReorderBufferAddInvalidations, so why it matter with this commit? One
> possibility is that we need allocations for multiple in-progress
> transactions now. I'll think more about this. It would be helpful if
> you could share more details about the workload, or if possible, a
> testcase or script using which we can reproduce this problem.
>

In response to

Browse pgsql-bugs by date

  From Date Subject
Next Message Hayato Kuroda (Fujitsu) 2025-05-21 11:48:12 RE: Logical replication 'invalid memory alloc request size 1585837200' after upgrading to 17.5
Previous Message Amit Kapila 2025-05-21 11:12:15 Re: Logical replication 'invalid memory alloc request size 1585837200' after upgrading to 17.5