Re: Logical replication 'invalid memory alloc request size 1585837200' after upgrading to 17.5

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
Cc: vignesh C <vignesh21(at)gmail(dot)com>, "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>, Duncan Sands <duncan(dot)sands(at)deepbluecap(dot)com>, "pgsql-bugs(at)lists(dot)postgresql(dot)org" <pgsql-bugs(at)lists(dot)postgresql(dot)org>
Subject: Re: Logical replication 'invalid memory alloc request size 1585837200' after upgrading to 17.5
Date: 2025-06-03 03:38:09
Message-ID: CAA4eK1JDogUOS7yx3yYL2VycWQ7THsfvmq+AnxjZx7YTOss4Mw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On Mon, Jun 2, 2025 at 10:21 PM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
>
> On Sun, Jun 1, 2025 at 8:13 PM vignesh C <vignesh21(at)gmail(dot)com> wrote:
> >
> > On Sat, 31 May 2025 at 10:20, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > >
> > >
> > > The concurrent abort handling is done for streaming and prepared
> > > transactions, where we send the transaction changes to the client
> > > before we read COMMIT/COMMIT PREPARED/ROLLBACK/ROLLBACK PREPARED. Now,
> > > among these COMMIT/ROLLBACK PREPARED cases are handled as part of a
> > > prepared transaction case. For ROLLBACK, we will never perform any
> > > changes from the current transaction, so we don't need distributed
> > > invalidations to be executed. For COMMIT, if we encounter any errors
> > > while processing changes (this is when we reach the ERROR path, which
> > > is not a concurrent abort), then we will reprocess all changes and, at
> > > the end, execute both the current transaction and distributed
> > > invalidations. Now, one possibility is that if, after ERROR, the
> > > caller does slot_advance to skip the ERROR, then we will probably miss
> > > executing the distributed invalidations, leading to data loss
> > > afterwards. If the above theory is correct, then it is better to
> > > execute distributed invalidation even in non-concurrent-abort cases in
> > > the ERROR path.
>
> This theory seems to be correct to me. If users re-try the logical
> decoding with smaller logical_decoding_work_mem the process could end
> up streaming the transaction with the wrong relcache before sending
> the committed transaction?
>

But will it give a problem? I think if a different transaction is
chosen for streaming after restart, then it should also have
distributed invalidation change queued for it at the appropriate
place, so that helps in making sure the cache is appropriate.

> > One possible reason this scenario may not occur is that
> > pg_logical_slot_get_changes_guts uses a PG_CATCH block to handle
> > exceptions, during which it calls InvalidateSystemCaches to clear the
> > system cache. Because of this, I believe the scenario might not
> > actually happen.
> > @Sawada-san / others — Are there any other cases where this could still occur?
>
> I think that we cannot confine use cases to walsender and
> pg_logical_slot_get_changes_guts() given that we expose logical
> decoding API at C level.
>

Hmm, yeah, that could be risky. Also, if we have to rely on the
InvalidateSystemCaches performed by the caller, then we don't even
need to execute the non-distributed invalidations. I think for the
sake of consistency and for the reason you mentioned, we can execute
distributed invalidations in the non-concurrent-abort ERROR path.

--
With Regards,
Amit Kapila.

In response to

Browse pgsql-bugs by date

  From Date Subject
Next Message Michael Paquier 2025-06-03 04:18:13 Re: BUG #18943: Return value of a function 'xmlBufferCreate' is dereferenced at xpath.c:177 without checking for NUL
Previous Message Amit Kapila 2025-06-03 03:09:29 Re: Logical replication 'invalid memory alloc request size 1585837200' after upgrading to 17.5