| From: | Chao Li <li(dot)evan(dot)chao(at)gmail(dot)com> |
|---|---|
| To: | PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org> |
| Cc: | Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> |
| Subject: | Fix race during concurrent logical decoding activation |
| Date: | 2026-05-28 09:09:13 |
| Message-ID: | 788B5B8A-BC22-48D8-818E-7B00416CF84E@gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
Hi,
While testing “Toggle logical decoding dynamically based on logical slot presence”, I hit an assertion failure with concurrent logical slot creation.
This is a repo:
1. In session 1, attach the injection point locally and start creating a logical slot. The session blocks at logical-decoding-activation:
```
evantest=# set application_name = 'slot_a';
SET
evantest=# select injection_points_set_local();
injection_points_set_local
----------------------------
(1 row)
evantest=# select injection_points_attach('logical-decoding-activation', 'wait');
injection_points_attach
-------------------------
(1 row)
evantest=# select pg_create_logical_replication_slot('slot_a', 'pgoutput');
```
2. In session 2, create another logical slot. This succeeds, and effective_wal_level becomes logical:
```
evantest=# select pg_create_logical_replication_slot('slot_b', 'pgoutput');
pg_create_logical_replication_slot
------------------------------------
(slot_b,0/0902E418)
(1 row)
evantest=# show effective_wal_level;
effective_wal_level
---------------------
logical
(1 row)
```
3. In session 2, cancel session 1 instead of waking it up:
```
evantest=# select pg_cancel_backend(pid) from pg_stat_activity where application_name = 'slot_a';
pg_cancel_backend
-------------------
t
(1 row)
```
Then the server hits this assertion:
```
TRAP: failed Assert("!LogicalDecodingCtl->logical_decoding_enabled"), File: "logicalctl.c", Line: 266, PID: 13768
0 postgres 0x00000001032b35d8 ExceptionalCondition + 216
1 postgres 0x0000000102f64600 abort_logical_decoding_activation + 120
2 postgres 0x0000000102f6451c EnsureLogicalDecodingEnabled + 412
3 postgres 0x0000000102f9f314 create_logical_replication_slot + 164
4 postgres 0x0000000102f9f1c4 pg_create_logical_replication_slot + 312
5 postgres 0x0000000102ce5f48 ExecInterpExpr + 3888
6 postgres 0x0000000102ce48b4 ExecInterpExprStillValid + 76
7 postgres 0x0000000102d57e94 ExecEvalExprNoReturn + 44
8 postgres 0x0000000102d57e54 ExecEvalExprNoReturnSwitchContext + 48
9 postgres 0x0000000102d57d18 ExecProject + 72
10 postgres 0x0000000102d57a9c ExecResult + 312
11 postgres 0x0000000102d06f1c ExecProcNodeFirst + 92
12 postgres 0x0000000102cfd8cc ExecProcNode + 60
13 postgres 0x0000000102cf83fc ExecutePlan + 244
14 postgres 0x0000000102cf8298 standard_ExecutorRun + 456
15 postgres 0x0000000102cf80c0 ExecutorRun + 84
16 postgres 0x000000010306fc64 PortalRunSelect + 296
17 postgres 0x000000010306f674 PortalRun + 656
18 postgres 0x000000010306a220 exec_simple_query + 1372
19 postgres 0x0000000103069348 PostgresMain + 3224
20 postgres 0x0000000103060a3c BackendInitialize + 0
21 postgres 0x0000000102f27db8 postmaster_child_launch + 464
22 postgres 0x0000000102f2f2ec BackendStartup + 304
23 postgres 0x0000000102f2d260 ServerLoop + 372
24 postgres 0x0000000102f2bd8c PostmasterMain + 6256
25 postgres 0x0000000102d99e84 main + 924
26 dyld 0x000000018cef7e00 start + 6992
2026-05-28 13:28:32.526 CST [13753] LOG: client backend (PID 13768) was terminated by signal 6: Abort trap: 6
2026-05-28 13:28:32.526 CST [13753] DETAIL: Failed process was running: select pg_create_logical_replication_slot('slot_a', 'pgoutput');
```
From my tracing, when session 1 is cancelled, session 1 entered abort_logical_decoding_activation(), and there is an assert:
```
Assert(!LogicalDecodingCtl->logical_decoding_enabled);
```
But session 2 had successfully created a slot and set LogicalDecodingCtl->logical_decoding_enabled to true, so this is a race condition.
I might be over thinking, but I just feel the safest fix is to make EnableLogicalDecoding() serialize. I tried serializing with LogicalDecodingControlLock and with a separate lock, but both approaches got deadlock around the barrier wait. I ended up with adding an activation_in_progress flag in shared memory, protected by LogicalDecodingControlLock, with a condition variable to wait for the active activation to finish.
With this fix, rerunning the repro makes session 2 wait while session 1 is blocked at the injection point. After canceling session 1 from session 3, session 2 continues, creates the slot successfully, and effective_wal_level becomes logical.
I didn’t include a test in this patch, as I wasn’t sure such a test would be desirable. If others think it is worth adding, I can convert the repro into a TAP test.
See the attached patch for details.
Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/
| Attachment | Content-Type | Size |
|---|---|---|
| v1-0001-Fix-race-during-concurrent-logical-decoding-activ.patch | application/octet-stream | 9.6 KB |
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Richard Guo | 2026-05-28 09:11:25 | Re: Fix HAVING-to-WHERE pushdown with mismatched operator families |
| Previous Message | Álvaro Herrera | 2026-05-28 08:55:45 | Re: Fix bug of CHECK constraint enforceability recursion |