Re: Fix race in ReplicationSlotRelease for ephemeral slots

From: Srinath Reddy Sadipiralla <srinath2133(at)gmail(dot)com>
To: "Zhijie Hou (Fujitsu)" <houzj(dot)fnst(at)fujitsu(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Fix race in ReplicationSlotRelease for ephemeral slots
Date: 2026-05-29 16:44:10
Message-ID: CAFC+b6o-hD5VxVLZQovmHSYykF8Qzq3eiuBU-U1F_yR9-y6P_w@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On Wed, May 27, 2026 at 5:20 PM Zhijie Hou (Fujitsu) <houzj(dot)fnst(at)fujitsu(dot)com>
wrote:

> Hi,
>
> While testing the slot release logic, I noticed a bug in
> ReplicationSlotRelease() where it may access a replication slot array
> entry that
> has already been released by itself.
>
> The detail is: When releasing an ephemeral replication slot,
> ReplicationSlotRelease() first drops the slot via
> ReplicationSlotDropAcquired().
> After this point, the slot's shared memory slot array entry can be
> immediately
> reused by another backend creating a new slot.
>
> However, ReplicationSlotRelease() continued executing common cleanup code
> that
> still dereferenced the old slot pointer and updated shared memory fields
> such as
> effective_xmin. If the slot array entry had already been reallocated, these
> writes could inadvertently affect a different, unrelated slot.
>
> I am attaching a patch that avoids touching slot shared-memory state after
> dropping an ephemeral slot. Keep the post-release shared-memory updates
> only for
> non-ephemeral slots, where the slot remains valid after release.
>
> To reproduce, we can use the following steps:
>
> 1. Attach gdb to the backend and set a breakpoint in
> ReplicationSlotRelease()
> right after ReplicationSlotDropAcquired() is called.
> 2. Create an ephemeral slot in the above backend with an invalid output
> plugin:
> SELECT pg_create_logical_replication_slot('test_slot_dropped',
> 'pgoutput2', false, false, true);
> 3. Once the breakpoint is hit, start another backend and create a new slot
> named 'test_slot_created'.
> 4. Release the breakpoint and allow the first backend to continue. At this
> point, you will see it updating the new slot 'test_slot_created' ->
> active_proc
> (and effective_xmin, if a snapshot is being exported) to invalid values.
> 5. Start a third backend and attempt to acquire the same slot
> 'test_slot_created' ? this should not be possible under normal
> circumstances,
> but the bug allows it.
>

patch LGTM.

>
> I haven't attached a test for this fix, as the change is straightforward
> and the
> likelihood of encountering this bug is low, so it may not be worth adding
> test
> cycles for it. However, if others feel differently, I'm OK to add one.
>

+1 for a test. The fix is just an else, so a future refactor could change
it and silently
reintroduce the corruption, since it scribbles on an unrelated reused slot,
nothing
would catch it. Injection points make it deterministic; I've attached a
diff patch that adds
a test that fails without the fix and passes with it.

--
Thanks,
Srinath Reddy Sadipiralla
EDB: https://www.enterprisedb.com/

Attachment Content-Type Size
nocfbot-test.patch application/octet-stream 4.3 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2026-05-29 16:57:43 Re: Uninitialized memory access in zic
Previous Message Andres Freund 2026-05-29 16:43:32 Uninitialized memory access in zic