Fix race during concurrent logical decoding activation

From: Chao Li <li(dot)evan(dot)chao(at)gmail(dot)com>
To: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Cc: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
Subject: Fix race during concurrent logical decoding activation
Date: 2026-05-28 09:09:13
Message-ID: 788B5B8A-BC22-48D8-818E-7B00416CF84E@gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

While testing “Toggle logical decoding dynamically based on logical slot presence”, I hit an assertion failure with concurrent logical slot creation.

This is a repo:

1. In session 1, attach the injection point locally and start creating a logical slot. The session blocks at logical-decoding-activation:
```
evantest=# set application_name = 'slot_a';
SET
evantest=# select injection_points_set_local();
injection_points_set_local
----------------------------

(1 row)
evantest=# select injection_points_attach('logical-decoding-activation', 'wait');
injection_points_attach
-------------------------

(1 row)
evantest=# select pg_create_logical_replication_slot('slot_a', 'pgoutput');
```

2. In session 2, create another logical slot. This succeeds, and effective_wal_level becomes logical:
```
evantest=# select pg_create_logical_replication_slot('slot_b', 'pgoutput');
pg_create_logical_replication_slot
------------------------------------
(slot_b,0/0902E418)
(1 row)

evantest=# show effective_wal_level;
effective_wal_level
---------------------
logical
(1 row)
```

3. In session 2, cancel session 1 instead of waking it up:
```
evantest=# select pg_cancel_backend(pid) from pg_stat_activity where application_name = 'slot_a';
pg_cancel_backend
-------------------
t
(1 row)
```

Then the server hits this assertion:
```
TRAP: failed Assert("!LogicalDecodingCtl->logical_decoding_enabled"), File: "logicalctl.c", Line: 266, PID: 13768
0 postgres 0x00000001032b35d8 ExceptionalCondition + 216
1 postgres 0x0000000102f64600 abort_logical_decoding_activation + 120
2 postgres 0x0000000102f6451c EnsureLogicalDecodingEnabled + 412
3 postgres 0x0000000102f9f314 create_logical_replication_slot + 164
4 postgres 0x0000000102f9f1c4 pg_create_logical_replication_slot + 312
5 postgres 0x0000000102ce5f48 ExecInterpExpr + 3888
6 postgres 0x0000000102ce48b4 ExecInterpExprStillValid + 76
7 postgres 0x0000000102d57e94 ExecEvalExprNoReturn + 44
8 postgres 0x0000000102d57e54 ExecEvalExprNoReturnSwitchContext + 48
9 postgres 0x0000000102d57d18 ExecProject + 72
10 postgres 0x0000000102d57a9c ExecResult + 312
11 postgres 0x0000000102d06f1c ExecProcNodeFirst + 92
12 postgres 0x0000000102cfd8cc ExecProcNode + 60
13 postgres 0x0000000102cf83fc ExecutePlan + 244
14 postgres 0x0000000102cf8298 standard_ExecutorRun + 456
15 postgres 0x0000000102cf80c0 ExecutorRun + 84
16 postgres 0x000000010306fc64 PortalRunSelect + 296
17 postgres 0x000000010306f674 PortalRun + 656
18 postgres 0x000000010306a220 exec_simple_query + 1372
19 postgres 0x0000000103069348 PostgresMain + 3224
20 postgres 0x0000000103060a3c BackendInitialize + 0
21 postgres 0x0000000102f27db8 postmaster_child_launch + 464
22 postgres 0x0000000102f2f2ec BackendStartup + 304
23 postgres 0x0000000102f2d260 ServerLoop + 372
24 postgres 0x0000000102f2bd8c PostmasterMain + 6256
25 postgres 0x0000000102d99e84 main + 924
26 dyld 0x000000018cef7e00 start + 6992
2026-05-28 13:28:32.526 CST [13753] LOG: client backend (PID 13768) was terminated by signal 6: Abort trap: 6
2026-05-28 13:28:32.526 CST [13753] DETAIL: Failed process was running: select pg_create_logical_replication_slot('slot_a', 'pgoutput');
```

From my tracing, when session 1 is cancelled, session 1 entered abort_logical_decoding_activation(), and there is an assert:
```
Assert(!LogicalDecodingCtl->logical_decoding_enabled);
```

But session 2 had successfully created a slot and set LogicalDecodingCtl->logical_decoding_enabled to true, so this is a race condition.

I might be over thinking, but I just feel the safest fix is to make EnableLogicalDecoding() serialize. I tried serializing with LogicalDecodingControlLock and with a separate lock, but both approaches got deadlock around the barrier wait. I ended up with adding an activation_in_progress flag in shared memory, protected by LogicalDecodingControlLock, with a condition variable to wait for the active activation to finish.

With this fix, rerunning the repro makes session 2 wait while session 1 is blocked at the injection point. After canceling session 1 from session 3, session 2 continues, creates the slot successfully, and effective_wal_level becomes logical.

I didn’t include a test in this patch, as I wasn’t sure such a test would be desirable. If others think it is worth adding, I can convert the repro into a TAP test.

See the attached patch for details.

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/

Attachment Content-Type Size
v1-0001-Fix-race-during-concurrent-logical-decoding-activ.patch application/octet-stream 9.6 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Richard Guo 2026-05-28 09:11:25 Re: Fix HAVING-to-WHERE pushdown with mismatched operator families
Previous Message Álvaro Herrera 2026-05-28 08:55:45 Re: Fix bug of CHECK constraint enforceability recursion