Re: Invalid pointer access in logical decoding after error

From: vignesh C <vignesh21(at)gmail(dot)com>
To: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
Cc: "Zhijie Hou (Fujitsu)" <houzj(dot)fnst(at)fujitsu(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Invalid pointer access in logical decoding after error
Date: 2025-09-29 08:52:29
Message-ID: CALDaNm2RPmNpyhiLj4WRn70izxnM7Cbmo53HUffZtg72MwZFCQ@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, 26 Sept 2025 at 05:29, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
>
> On Thu, Jul 3, 2025 at 7:55 AM vignesh C <vignesh21(at)gmail(dot)com> wrote:
> >
> > On Wed, 2 Jul 2025 at 13:21, Zhijie Hou (Fujitsu)
> > <houzj(dot)fnst(at)fujitsu(dot)com> wrote:
> > >
> > > On Wed, Jul 2, 2025 at 2:42 PM vignesh C wrote:
> > >
> > > >
> > > > Hi,
> > > >
> > > > I encountered an invalid pointer access issue. Below are the steps to
> > > > reproduce the issue:
> > > ...
> > > > The error occurs because entry->columns is allocated in the entry
> > > > private context (entry->entry_cxt) by pub_collist_to_bitmapset(). This
> > > > context is a child of the PortalContext, which is cleared after an
> > > > error via: AbortTransaction
> > > > -> AtAbort_Portals ->
> > > > MemoryContextDeleteChildren -> MemoryContextDelete ->
> > > > MemoryContextDeleteOnly
> > > > As a result, the memory backing entry->columns is freed, but the
> > > > RelationSyncCache which resides in CacheMemoryContext and thus
> > > > survives the error still holds a dangling pointer to this freed
> > > > memory, causing it to pfree an invalid pointer.
> > > > In the normal (positive) execution flow, pgoutput_shutdown() is called
> > > > to clean up the RelationSyncCache. This happens via:
> > > > FreeDecodingContext -> shutdown_cb_wrapper -> pgoutput_shutdown But
> > > > this is not called in case of an error case. To handle this case
> > > > safely, I suggest calling FreeDecodingContext in the PG_CATCH block to
> > > > ensure pgoutput_shutdown is invoked and the stale cache is cleared appropriately.
> > > > Attached patch has the changes for the same.
> > > > Thoughts?
> > >
> > > Thank you for reporting the issue and providing a fix.
> > >
> > > I recall that we identified this general issue with the hash table in pgoutput
> > > in other threads as well [1]. The basic consensus [2] is that calling
> > > FreeDecodingContext() within PG_CATCH is not ideal, as this function includes
> > > user code, increasing the risk of encountering another error within PG_CATCH.
> > > This scenario could prevent execution of subsequent code to invalidate syscache
> > > entries, which is problematic.
> >
> > Yes, let's avoid this.
> >
> > > I think a better fix could be to introduce a memory context reset callback(on
> > > data->cachectx) and perform the actions of pgoutput_shutdown() within it.
> >
> > The attached v2 version patch has the changes for the same.
>
> We've addressed several memory-related issues in pgoutput. While most
> of these issues didn't affect logical replication, they did impact
> logical decoding called via SQL API. I find that these problems stem
> from RelationSyncCache being defined as a file-scope static variable
> and being allocated in CacheMemoryContext. I'm wondering if we could
> move it to PGOutputData and create it under the logical decoding
> context. This would ensure it's automatically cleaned up along with
> the logical decoding context.

I see one issue with keeping RelationSyncCache inside PGOutputData.
Both rel_sync_cache_publication_cb and rel_sync_cache_relation_cb rely
on this cache: they check its validity and update it in response to
invalidation events. The problem is that these callbacks cannot be
unregistered once they are registered. After pgoutput_shutdown() runs,
the hash table is destroyed and later FreeDecodingContext deletes the
logical decoding context. At that point, the registered callbacks may
still fire, but they’ll be holding onto a freed memory.

This is also noted directly in the callbacks:
/*
* We can get here if the plugin was used in SQL interface as the
* RelationSyncCache is destroyed when the decoding finishes, but there is
* no way to unregister the relcache invalidation callback.
*/
if (RelationSyncCache == NULL)
return;

So the core issue is that the callbacks may outlive the cache they
reference, and we don’t have a mechanism to unregister them cleanly.

Thoughts?

Regards,
Vignesh

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Chao Li 2025-09-29 09:21:04 Re: bug, ALTER TABLE call ATPostAlterTypeCleanup twice for the same relation
Previous Message Richard Guo 2025-09-29 08:52:18 Re: plan shape work