| From: | Pradeep Kumar <spradeepkumar29(at)gmail(dot)com> | 
|---|---|
| To: | Alexander Lakhin <exclusion(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> | 
| Cc: | vignesh C <vignesh21(at)gmail(dot)com>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> | 
| Subject: | Re: Assertion failure in SnapBuildInitialSnapshot() | 
| Date: | 2025-10-27 12:21:48 | 
| Message-ID: | CAJ4xhP=6h4RrwWpWSaJY4KkrJFCMUZTTGuM53=3wCMqMTBjqKQ@mail.gmail.com | 
| Views: | Whole Thread | Raw Message | Download mbox | Resend email | 
| Thread: | |
| Lists: | pgsql-hackers | 
Hi All,
In this thread
<https://www.postgresql.org/message-id/flat/CAA4eK1L8wYcyTPxNzPGkhuO52WBGoOZbT0A73Le%3DZUWYAYmdfw%40mail.gmail.com>
they
proposed fix_concurrent_slot_xmin_update.patch will solve this assert
failure. After applying this patch I execute pg_sync_replication_slots()
(which calls SyncReplicationSlots → synchronize_slots() →
synchronize_one_slot() → ReplicationSlotsComputeRequiredXmin(true)) can hit
an assertion failure in ReplicationSlotsComputeRequiredXmin() because the
ReplicationSlotControlLock is not held in that code path. By default
sync_replication_slots is off, so the background slot-sync worker is not
spawned; invoking the UDF directly exercises the path without the lock. I
have a small patch that acquires ReplicationSlotControlLock in the manual
sync path; that stops the assert.
Call Stack :
TRAP: failed Assert("!already_locked ||
(LWLockHeldByMeInMode(ReplicationSlotControlLock, LW_EXCLUSIVE) &&
LWLockHeldByMeInMode(ProcArrayLock, LW_EXCLUSIVE))"), File: "slot.
c", Line: 1061, PID: 67056
0   postgres                            0x000000010104aad4
ExceptionalCondition + 216
1   postgres                            0x0000000100d8718c
ReplicationSlotsComputeRequiredXmin + 180
2   postgres                            0x0000000100d6fba8
synchronize_one_slot + 1488
3   postgres                            0x0000000100d6e8cc
synchronize_slots + 1480
4   postgres                            0x0000000100d6efe4
SyncReplicationSlots + 164
5   postgres                            0x0000000100d8da84
pg_sync_replication_slots + 476
6   postgres                            0x0000000100b34c58 ExecInterpExpr +
2388
7   postgres                            0x0000000100b33ee8
ExecInterpExprStillValid + 76
8   postgres                            0x00000001008acd5c
ExecEvalExprSwitchContext + 64
9   postgres                            0x0000000100b54d48 ExecProject + 76
10  postgres                            0x0000000100b925d4 ExecResult + 312
11  postgres                            0x0000000100b5083c
ExecProcNodeFirst + 92
12  postgres                            0x0000000100b48b88 ExecProcNode + 60
13  postgres                            0x0000000100b44410 ExecutePlan + 184
14  postgres                            0x0000000100b442dc
standard_ExecutorRun + 644
15  postgres                            0x0000000100b44048 ExecutorRun + 104
16  postgres                            0x0000000100e3053c PortalRunSelect
+ 308
17  postgres                            0x0000000100e2ff40 PortalRun + 736
18  postgres                            0x0000000100e2b21c
exec_simple_query + 1368
19  postgres                            0x0000000100e2a42c PostgresMain +
2508
20  postgres                            0x0000000100e22ce4
BackendInitialize + 0
21  postgres                            0x0000000100d1fd4c
postmaster_child_launch + 304
22  postgres                            0x0000000100d26d9c BackendStartup +
448
23  postgres                            0x0000000100d23f18 ServerLoop + 372
24  postgres                            0x0000000100d22f18 PostmasterMain +
6396
25  postgres                            0x0000000100bcffd4 init_locale + 0
26  dyld                                0x0000000186d82b98 start + 6076
The assert is raised inside ReplicationSlotsComputeRequiredXmin() because
that function expects either that already_locked is false (and it will
acquire what it needs), or that callers already hold both
ReplicationSlotControlLock (exclusive) and ProcArrayLock (exclusive). In
the manual-sync path called by the UDF, neither lock is held, so the
assertion trips.
Why this happens:
The background slot sync worker (spawned when sync_replication_slots = on)
acquires the necessary locks before calling the routines that
update/compute slot xmins, so the worker path is safe.The manual path
through the SQL-callable UDF does not take the same locks before calling
synchronize_slots()/synchronize_one_slot(). As a result the invariant
assumed by ReplicationSlotsComputeRequiredXmin() can be violated, leading
to the assert.
Proposed fix:
In synchronize_slots() (the code path used by
SyncReplicationSlots()/pg_sync_replication_slots()), acquire
ReplicationSlotControlLock before any call that can end up calling
ReplicationSlotsComputeRequiredXmin(true).
Thanks and Regards
Pradeep
On Mon, Oct 27, 2025 at 3:09 PM Alexander Lakhin <exclusion(at)gmail(dot)com>
wrote:
> Hello,
>
> 01.02.2024 21:20, vignesh C wrote:
> > The patch which you submitted has been awaiting your attention for
> > quite some time now.  As such, we have moved it to "Returned with
> > Feedback" and removed it from the reviewing queue. Depending on
> > timing, this may be reversible.  Kindly address the feedback you have
> > received, and resubmit the patch to the next CommitFest.
>
> While analyzing buildfarm failures, I found [1], which demonstrates the
> assertion failure discussed here:
> ---
> 031_column_list_publisher.log
> TRAP: FailedAssertion("TransactionIdPrecedesOrEquals(safeXid,
> snap->xmin)", File:
> "/home/bf/bf-build/skink/REL_15_STABLE/pgsql.build/../pgsql/src/backend/replication/logical/snapbuild.c",
> Line: 614,
> PID: 1882382)
> ---
>
> I've managed to reproduce the assertion failure on REL_15_STABLE with the
> following modification:
> @@ -3928,6 +3928,7 @@ ProcArraySetReplicationSlotXmin(TransactionId xmin,
> TransactionId catalog_xmin,
>   {
>       Assert(!already_locked || LWLockHeldByMe(ProcArrayLock));
>
> +pg_usleep(1000);
>       if (!already_locked)
>           LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
>
> using the script:
> numjobs=100
> createdb db
> export PGDATABASE=db
>
> for ((i=1;i<=100;i++)); do
> echo "iteration $i"
>
> for ((j=1;j<=numjobs;j++)); do
> echo "
> SELECT pg_create_logical_replication_slot('s$j', 'test_decoding');
> SELECT txid_current();
> " | psql >>/dev/null 2>&1 &
>
> echo "
> BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;
> CREATE_REPLICATION_SLOT slot$j LOGICAL test_decoding USE_SNAPSHOT;
> " | psql -d "dbname=db replication=database" >>/dev/null 2>&1 &
> done
> wait
>
> for ((j=1;j<=numjobs;j++)); do
> echo "
> DROP_REPLICATION_SLOT slot$j;
> " | psql -d "dbname=db replication=database" >/dev/null
>
> echo "SELECT pg_drop_replication_slot('s$j');" | psql >/dev/null
> done
>
> grep 'TRAP' server.log && break;
> done
>
> (with
> wal_level = logical
> max_replication_slots = 200
> max_wal_senders = 200
> in postgresql.conf)
>
> iteration 18
> ERROR:  replication slot "slot13" is active for PID 538431
> TRAP: FailedAssertion("TransactionIdPrecedesOrEquals(safeXid,
> snap->xmin)", File: "snapbuild.c", Line: 614, PID: 538431)
>
>
> I've also confirmed that fix_concurrent_slot_xmin_update.patch fixes the
> issue.
>
> [1]
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2024-05-15%2020%3A55%3A17
>
> Best regards,
> Alexander
>
>
>
>
>
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Alena Vinter | 2025-10-27 12:22:12 | Re: Resetting recovery target parameters in pg_createsubscriber | 
| Previous Message | Alexander Korotkov | 2025-10-27 12:13:36 | Re: Add SPLIT PARTITION/MERGE PARTITIONS commands |