| From: | Pradeep Kumar <spradeepkumar29(at)gmail(dot)com> | 
|---|---|
| To: | Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> | 
| Cc: | Alexander Lakhin <exclusion(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, vignesh C <vignesh21(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> | 
| Subject: | Re: Assertion failure in SnapBuildInitialSnapshot() | 
| Date: | 2025-10-30 07:12:41 | 
| Message-ID: | CAJ4xhP=RBbHBjj1-wOLKr=ido8iH=oWVpc0y35EN7F0xx3ZWqA@mail.gmail.com | 
| Views: | Whole Thread | Raw Message | Download mbox | Resend email | 
| Thread: | |
| Lists: | pgsql-hackers | 
Hi ,
Thanks for reviewing this issue, After applying the patch
fix_concurrent_slot_xmin_update that you have submitted before [1]  on
REL_18_STABLE, Here are the below steps to reproduce this issue manually
Steps :
1) <postgres_bin_directory>/initdb -D primary
2)  echo "wal_level=logical" > primary/postgresql.conf (Edit the
postgresql.conf to add wal_level = logical)
3)  echo "hot_standby_feeback = on" > primary/postgresql.conf (Edit the
postgresql.conf to hot_standby_feedback = on)
4) Start the postgres server (Primary)
5) Connect Primary  => <postgres_bin_directory>/psql postgres
6) Execute UDF "SELECT pg_create_physical_replication_slot('standby_slot');
" on primary
7) Execute UDF "SELECT
pg_create_logical_replication_slot('test_logical_slot', 'pgoutput', false,
false, true); " on primary
8) <postgres_bin_directory>/pg_basebackup -h localhost -p 5432 -D 'standby'
-R (get a basebackup of primary to attach as replica to the primary)
9) Edit standby/postgresql.conf as "wal_level=logical" ,
"primary_slot_name=standby_slot"
10) Edit standby/postgresql.auto.conf  as "dbname=postgres" in
primary_conninfo GUC
11) Start the standby server
12) execute UDF "SELECT pg_sync_replication_slots();" => this will leads to
assert failure
Here I attached the updated patch to solve this issue.
Thanks and Regards
Pradeep
On Thu, Oct 30, 2025 at 4:31 AM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
wrote:
> On Mon, Oct 27, 2025 at 5:22 AM Pradeep Kumar <spradeepkumar29(at)gmail(dot)com>
> wrote:
> >
> > Hi All,
> > In this thread they proposed fix_concurrent_slot_xmin_update.patch will
> solve this assert failure. After applying this patch I execute
> pg_sync_replication_slots() (which calls SyncReplicationSlots →
> synchronize_slots() → synchronize_one_slot() →
> ReplicationSlotsComputeRequiredXmin(true)) can hit an assertion failure in
> ReplicationSlotsComputeRequiredXmin() because the
> ReplicationSlotControlLock is not held in that code path. By default
> sync_replication_slots is off, so the background slot-sync worker is not
> spawned; invoking the UDF directly exercises the path without the lock. I
> have a small patch that acquires ReplicationSlotControlLock in the manual
> sync path; that stops the assert.
> >
> > Call Stack :
> > TRAP: failed Assert("!already_locked ||
> (LWLockHeldByMeInMode(ReplicationSlotControlLock, LW_EXCLUSIVE) &&
> LWLockHeldByMeInMode(ProcArrayLock, LW_EXCLUSIVE))"), File: "slot.
> > c", Line: 1061, PID: 67056
> > 0   postgres                            0x000000010104aad4
> ExceptionalCondition + 216
> > 1   postgres                            0x0000000100d8718c
> ReplicationSlotsComputeRequiredXmin + 180
> > 2   postgres                            0x0000000100d6fba8
> synchronize_one_slot + 1488
> > 3   postgres                            0x0000000100d6e8cc
> synchronize_slots + 1480
> > 4   postgres                            0x0000000100d6efe4
> SyncReplicationSlots + 164
> > 5   postgres                            0x0000000100d8da84
> pg_sync_replication_slots + 476
> > 6   postgres                            0x0000000100b34c58
> ExecInterpExpr + 2388
> > 7   postgres                            0x0000000100b33ee8
> ExecInterpExprStillValid + 76
> > 8   postgres                            0x00000001008acd5c
> ExecEvalExprSwitchContext + 64
> > 9   postgres                            0x0000000100b54d48 ExecProject +
> 76
> > 10  postgres                            0x0000000100b925d4 ExecResult +
> 312
> > 11  postgres                            0x0000000100b5083c
> ExecProcNodeFirst + 92
> > 12  postgres                            0x0000000100b48b88 ExecProcNode
> + 60
> > 13  postgres                            0x0000000100b44410 ExecutePlan +
> 184
> > 14  postgres                            0x0000000100b442dc
> standard_ExecutorRun + 644
> > 15  postgres                            0x0000000100b44048 ExecutorRun +
> 104
> > 16  postgres                            0x0000000100e3053c
> PortalRunSelect + 308
> > 17  postgres                            0x0000000100e2ff40 PortalRun +
> 736
> > 18  postgres                            0x0000000100e2b21c
> exec_simple_query + 1368
> > 19  postgres                            0x0000000100e2a42c PostgresMain
> + 2508
> > 20  postgres                            0x0000000100e22ce4
> BackendInitialize + 0
> > 21  postgres                            0x0000000100d1fd4c
> postmaster_child_launch + 304
> > 22  postgres                            0x0000000100d26d9c
> BackendStartup + 448
> > 23  postgres                            0x0000000100d23f18 ServerLoop +
> 372
> > 24  postgres                            0x0000000100d22f18
> PostmasterMain + 6396
> > 25  postgres                            0x0000000100bcffd4 init_locale +
> 0
> > 26  dyld                                0x0000000186d82b98 start + 6076
> >
> > The assert is raised inside ReplicationSlotsComputeRequiredXmin()
> because that function expects either that already_locked is false (and it
> will acquire what it needs), or that callers already hold both
> ReplicationSlotControlLock (exclusive) and ProcArrayLock (exclusive). In
> the manual-sync path called by the UDF, neither lock is held, so the
> assertion trips.
> >
> > Why this happens:
> > The background slot sync worker (spawned when sync_replication_slots =
> on) acquires the necessary locks before calling the routines that
> update/compute slot xmins, so the worker path is safe.The manual path
> through the SQL-callable UDF does not take the same locks before calling
> synchronize_slots()/synchronize_one_slot(). As a result the invariant
> assumed by ReplicationSlotsComputeRequiredXmin() can be violated, leading
> to the assert.
> >
> > Proposed fix:
> > In synchronize_slots() (the code path used by
> SyncReplicationSlots()/pg_sync_replication_slots()), acquire
> ReplicationSlotControlLock before any call that can end up calling
> ReplicationSlotsComputeRequiredXmin(true).
>
> It would be great if we have a test case for this issue possibly using
> injection points.
>
> Also, I think it's worth considering the idea Robert shared before[1]:
>
> ---
> But what about just surgically preventing that?
> ProcArraySetReplicationSlotXmin() could refuse to retreat the values,
> perhaps? If it computes an older value than what's there, it just does
> nothing?
> ---
>
> We did a similar fix for confirmed_flush LSN by commit ad5eaf390c582,
> and it sounds reasonable to me that ProcArraySetReplicationSlotXmin()
> refuses to retreat the values.
>
> Regards,
>
> [1]
> https://www.postgresql.org/message-id/CA%2BTgmoYLzJxCEa0aCan3KR7o_25G52cbqw-90Q0VGRmV3a8XGQ%40mail.gmail.com
>
> --
> Masahiko Sawada
> Amazon Web Services: https://aws.amazon.com
>
| Attachment | Content-Type | Size | 
|---|---|---|
| fix_concurrent_slot_xmin_update_version2.patch | application/octet-stream | 4.8 KB | 
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Shinya Kato | 2025-10-30 07:38:30 | Re: Add mode column to pg_stat_progress_vacuum | 
| Previous Message | Michael Paquier | 2025-10-30 06:57:13 | Re: Add wal_fpi_bytes_[un]compressed to pg_stat_wal |