| From: | "Zhijie Hou (Fujitsu)" <houzj(dot)fnst(at)fujitsu(dot)com> |
|---|---|
| To: | Pradeep Kumar <spradeepkumar29(at)gmail(dot)com>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> |
| Cc: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Alexander Lakhin <exclusion(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, vignesh C <vignesh21(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
| Subject: | RE: Assertion failure in SnapBuildInitialSnapshot() |
| Date: | 2025-11-13 04:56:01 |
| Message-ID: | TY4PR01MB1690784F8EFFDC149FE67FAF594CDA@TY4PR01MB16907.jpnprd01.prod.outlook.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
On Wednesday, November 12, 2025 7:27 PM Pradeep Kumar <spradeepkumar29(at)gmail(dot)com> wote:
> I've been investigating the assert failure in
> ProcArraySetReplicationSlotXmin() and would like to share my approach and get
> feedback. Instead of inverting the locks and what robert shared before [1].
> Instead of unconditionally updating procArray->replication_slot_xmin in
> ProcArraySetReplicationSlotXmin() in procarray.c, I made the updates
> conditional:
> 1) Only update if the incoming xmin is valid
> 2) Only update if it's older than the currently stored xmin
> 3) Do the same for procArray->replication_slot_catalog_xmin
...
> In above block of code ensures we always track the minimum xmin across all
> active replication slots without losing data. And also no need to worry about
> locks. And also while reproducing this issue [2] In SnapBuildInitialSnapshot()
> while we computing safexid by calling GetOldestSafeDecodingTransactionId(false)
> will enters into first case and update the oldestSafeXid =
> procArray->replication_slot_xmin. So it won't return nextXid. And also it solves
> this issue [2].
Thanks for evaluating new approach, but I think this approach could not work
because we expect replication_slot_xmin to be set to an invalid number when the
last slot is dropped, while this approach would disallow that, causing WALs to
be retained. For a detailed explanation, please refer to [1].
While testing the patches across all branches, I noticed that an additional lock
needs to be added in the launcher.c where
ReplicationSlotsComputeRequiredXmin(true) was recently added for conflict
detection slot. I have modified the original patch accordingly.
BTW, I am not adding a test using an injection point because it does not seem
practical to insert an injection point inner
ReplicationSlotsComputeRequiredXmin. The reason is that the injection point
function internally calls CHECK_FOR_INTERRUPTS(), but the key functions in the
patch holds the lwlock, holding holds interrupts.
I am sharing the patches for all branches for reference.
Best Regards,
Hou zj
| Attachment | Content-Type | Size |
|---|---|---|
| v3HEAD-0001-Fix-a-race-condition-of-updating-procArray-replic.patch | application/octet-stream | 7.3 KB |
| v3PG18-17-0001-Fix-a-race-condition-of-updating-procArray-re.patch | application/octet-stream | 6.5 KB |
| v3PG16-13-0001-Fix-a-race-condition-of-updating-procArray-re.patch | application/octet-stream | 5.6 KB |
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Chao Li | 2025-11-13 05:14:37 | Re: Suggestion to add --continue-client-on-abort option to pgbench |
| Previous Message | Shinya Kato | 2025-11-13 04:49:40 | Re: Add mode column to pg_stat_progress_vacuum |