Re: Logical replication 'invalid memory alloc request size 1585837200' after upgrading to 17.5

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>
Cc: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Duncan Sands <duncan(dot)sands(at)deepbluecap(dot)com>, "pgsql-bugs(at)lists(dot)postgresql(dot)org" <pgsql-bugs(at)lists(dot)postgresql(dot)org>
Subject: Re: Logical replication 'invalid memory alloc request size 1585837200' after upgrading to 17.5
Date: 2025-05-23 03:00:35
Message-ID: CAA4eK1L7CA-A=VMn8fiugZ+CRt+wz473Adrx3nxq8Ougu=O2kQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On Thu, May 22, 2025 at 6:29 PM Hayato Kuroda (Fujitsu)
<kuroda(dot)hayato(at)fujitsu(dot)com> wrote:
>
> Dear Amit, Sawada-san,
>
> > Good point. After replaying the transaction, it doesn't matter because
> > we would have already relayed the required invalidation while
> > processing REORDER_BUFFER_CHANGE_INVALIDATION messages. However
> > for
> > concurrent abort case it could matter. See my analysis for the same
> > below:
> >
> > Simulation of concurrent abort
> > ------------------------------------------
> > 1) S1: CREATE TABLE d(data text not null);
> > 2) S1: INSERT INTO d VALUES('d1');
> > 3) S2: BEGIN; INSERT INTO d VALUES('d2');
> > 4) S2: INSERT INTO unrelated_tab VALUES(1);
> > 5) S1: ALTER PUBLICATION pb ADD TABLE d;
> > 6) S2: INSERT INTO unrelated_tab VALUES(2);
> > 7) S2: ROLLBACK;
> > 8) S2: INSERT INTO d VALUES('d3');
> > 9) S1: INSERT INTO d VALUES('d4');
>
> > The problem with the sequence is that the insert from 3) could be
> > decoded *after* 5) in step 6) due to streaming and that to decode the
> > insert (which happened before the ALTER) the catalog snapshot and
> > cache state is from *before* the ALTER TABLE. Because the transaction
> > started in 3) doesn't actually modify any catalogs, no invalidations
> > are executed after decoding it. Now, assume, while decoding Insert
> > from 4), we detected a concurrent abort, then the distributed
> > invalidation won't be executed, and if we don't have accumulated
> > messages in txn->invalidations, then the invalidation from step 5)
> > won't be performed. The data loss can occur in steps 8 and 9. This is
> > just a theory, so I could be missing something.
>
> I verified this is real or not, and succeeded to reproduce. See appendix the
> detailed steps.
>
> > If the above turns out to be a problem, one idea for fixing it is that
> > for the concurrent abort case (both during streaming and for prepared
> > transaction's processing), we still check all the remaining changes
> > and process only the changes related to invalidations. This has to be
> > done before the current txn changes are freed via
> > ReorderBufferResetTXN->ReorderBufferTruncateTXN.
>
> I roughly implemented the part, PSA the updated version. One concern is whether we
> should consider the case that invalidations can cause ereport(ERROR). If happens,
> the walsender will exit at that time.
>

But, in the catch part, we are already executing invalidations:
...
/* make sure there's no cache pollution */
ReorderBufferExecuteInvalidations(txn->ninvalidations, txn->invalidations);
...

So, the behaviour should be the same.

--
With Regards,
Amit Kapila.

In response to

Browse pgsql-bugs by date

  From Date Subject
Next Message Michael Paquier 2025-05-23 03:22:21 Re: SIMILAR TO expressions translate wildcards where they shouldn't
Previous Message Michael Paquier 2025-05-23 01:10:04 Re: SIMILAR TO expressions translate wildcards where they shouldn't