| From: | Chao Li <li(dot)evan(dot)chao(at)gmail(dot)com> |
|---|---|
| To: | "Zhijie Hou (Fujitsu)" <houzj(dot)fnst(at)fujitsu(dot)com> |
| Cc: | Postgres hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
| Subject: | Re: walsender: Assert MyReplicationSlot is set before use |
| Date: | 2026-02-03 03:37:36 |
| Message-ID: | 9A9121BE-C7E1-4DC8-9D29-FCA73BBDB703@gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
> On Feb 2, 2026, at 15:37, Zhijie Hou (Fujitsu) <houzj(dot)fnst(at)fujitsu(dot)com> wrote:
>
> On Monday, February 2, 2026 3:16 PM Chao Li <li(dot)evan(dot)chao(at)gmail(dot)com> wrote:
>>
>>> On Feb 2, 2026, at 14:43, Zhijie Hou (Fujitsu) <houzj(dot)fnst(at)fujitsu(dot)com>
>> wrote:
>>>
>>> On Monday, February 2, 2026 2:25 PM Chao Li <li(dot)evan(dot)chao(at)gmail(dot)com> wrote:
>>>
>>>>
>>>> Another related issue is in StartLogicalReplication(): the function
>>>> initially asserts MyReplicationSlot == NULL, then calls
>>>> ReplicationSlotAcquire(), which is expected to set up
>>>> MyReplicationSlot. Since MyReplicationSlot is dereferenced later in
>>>> this function, I think it would be clearer and safer to add an
>>>> Assert(MyReplicationSlot != NULL) immediately after the call to
>> ReplicationSlotAcquire().
>>>
>>> I don't think the suggested Assert is unnecessary, as the primary
>>> function of
>>> ReplicationSlotAcquire() is to get MyReplicationSlot assigned a valid
>>> value. Any issues that arise should be handled within the function
>>> itself. I think an Assert placed after this function does not add additional
>> safety.
>>>
>>
>> Yes, ReplicationSlotAcquire() has a retry logic to ensure MyReplicationSlot to
>> be set. So adding a NULL check here is a bit redundant. How about
>> Assert(SlotIsLogical(MyReplicationSlot))? Does it make more sense?
>
> I think we cannot assume the slot type here. A suitable checking might
> be: If a physical slot was acquired during logical replication, report an error,
> just like we do in StartReplication().
Good point. In StartReplication(), we check MyReplicationSlot is not logical, correspondingly, in StartLogicalReplication(), we should check MyReplicationSlot is not physical.
>
> I haven't verified whether this error will occur in practice, but I think we
> could reproduce this by dropping the original logical slot and then creating a
> new physical slot with the same name concurrently.
>
Without sufficient load, this race is hard to trigger manually, so I added a TAP test using an injection point. With the injection point in place, the TAP test can hit the error:
```
# +++ tap check in src/test/recovery +++
t/052_logical_slot_type_race.pl .. ok
All tests successful.
Files=1, Tests=2, 3 wallclock secs ( 0.00 usr 0.00 sys + 0.11 cusr 0.36 csys = 0.47 CPU)
Result: PASS
```
I think this test demonstrates that adding a sanity check in StartLogicalReplication() is necessary.
I’ve put the TAP script in 0002 and will leave it to the committer to decide whether to accept the test.
PFA v2:
* 0001 changes the Assert to explicitly verify that MyReplicationSlot is a logical slot.
* 0002 adds a TAP test covering a race where a logical slot is dropped while a physical slot with the same name is created concurrently. If 0002 is accepted, I’m fine with squashing it into 0001.
Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/
| Attachment | Content-Type | Size |
|---|---|---|
| v2-0001-walsender-Assert-MyReplicationSlot-is-set-before-.patch | application/octet-stream | 1.4 KB |
| v2-0002-Add-TAP-test-for-logical-slot-type-race.patch | application/octet-stream | 4.1 KB |
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Chao Li | 2026-02-03 03:46:03 | Re: pg_resetwal: Fix wrong directory in log output |
| Previous Message | Corey Huinker | 2026-02-03 03:34:14 | Re: Add expressions to pg_restore_extended_stats() |