Re: Refactor recovery conflict signaling a little

From: Xuneng Zhou <xunengzhou(at)gmail(dot)com>
To: Alexander Lakhin <exclusion(at)gmail(dot)com>
Cc: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, Chao Li <li(dot)evan(dot)chao(at)gmail(dot)com>
Subject: Re: Refactor recovery conflict signaling a little
Date: 2026-03-09 06:59:31
Message-ID: CABPTF7X0j91LFh_Rk3V01rhGURSRp27M0Ts3YGe2vYPFHaijFg@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi Alexander,

On Sat, Mar 7, 2026 at 7:00 PM Alexander Lakhin <exclusion(at)gmail(dot)com> wrote:
>
> Hello Xuneng and Heikki,
>
> 04.03.2026 07:33, Xuneng Zhou wrote:
>
> 03.03.2026 17:39, Heikki Linnakangas wrote:
>
> On 24/02/2026 10:00, Alexander Lakhin wrote:
>
> The "terminating process ..." message doesn't appear when the test passes
> successfully.
>
> Hmm, right, looks like something wrong in signaling the recovery conflict. I can't tell if the signal is being sent,
> or it's not processed correctly. Looking at the code, I don't see anything wrong.
>
> I was unable to reproduce the issue on an x86_64 Linux machine using
> the provided script. All test runs completed successfully without any
> failures.
>
>
> I've added debug logging (see attached) and saw the following:
> !!!SignalRecoveryConflict[282363]
> !!!ProcArrayEndTransaction| pendingRecoveryConflicts = 0
> !!!ProcessInterrupts[283863]| MyProc->pendingRecoveryConflicts: 0
> !!!ProcessInterrupts[283863]| MyProc->pendingRecoveryConflicts: 0
> 2026-03-07 12:21:24.544 EET walreceiver[282421] FATAL: could not receive data from WAL stream: server closed the connection unexpectedly
> This probably means the server terminated abnormally
> before or while processing the request.
> 2026-03-07 12:21:24.645 EET postmaster[282355] LOG: received immediate shutdown request
> 2026-03-07 12:21:24.647 EET postmaster[282355] LOG: database system is shut down
>
> While for a successful run, I see:
> 2026-03-07 12:18:17.075 EET startup[285260] DETAIL: The slot conflicted with xid horizon 677.
> 2026-03-07 12:18:17.075 EET startup[285260] CONTEXT: WAL redo at 0/04022130 for Heap2/PRUNE_ON_ACCESS: snapshotConflictHorizon: 677, isCatalogRel: T, nplans: 0, nredirected: 0, ndead: 2, nunused: 0, dead: [35, 36]; blkref #0: rel 1663/16384/16418, blk 10
> !!!SignalRecoveryConflict[285260]
> !!!ProcessInterrupts[286071]| MyProc->pendingRecoveryConflicts: 16
> !!!ProcessRecoveryConflictInterrupts[286071]
> !!!ProcessRecoveryConflictInterrupts[286071] pending: 16, reason: 4
> 2026-03-07 12:18:17.075 EET walsender[286071] 035_standby_logical_decoding.pl ERROR: canceling statement due to conflict with recovery
> 2026-03-07 12:18:17.075 EET walsender[286071] 035_standby_logical_decoding.pl DETAIL: User was using a logical replication slot that must be invalidated.
>
> (Full logs for this failed run and a good run are attached.)
>

Thanks again for the logs. I think we might frame the next run around
three explicit hypotheses:

1) Self-clear: the target walsender clears its own
pendingRecoveryConflicts during xact cleanup before
ProcessInterrupts() consumes it.
2) Group-clear-on-behalf: another backend clears the target PGPROC via
group clear before consume.
3) Visibility gap: the weak pg_atomic_read_u32() in the interrupt path
observes 0 even though the conflict bit was set.

The bad vs good traces are consistent with a receiver-side handoff failure:
- bad run: signal path reached, target walsender later sees
pendingRecoveryConflicts = 0, never dispatches conflict;
- good run: target sees pending = 0x10, dispatches reason 4
(RECOVERY_CONFLICT_LOGICALSLOT), throws expected ERROR, releases slot.

I prepared an updated diagnostics patch to classify hypotheses (1) and
(2) directly:
- SignalRecoveryConflict: logs target pid/procno and old/new mask
- clear sites: logs self_pid, target_pid, target_procno, self_is_target
- ProcessInterrupts: logs handler counter + pending-mask observations

Two caveats:
- The weak/strong telemetry is supportive but not definitive for
hypothesis (3); it may miss the exact stale-read window.
- The patch preserves the original dispatch path, but added
diagnostics (especially barriered reads in clear paths) can perturb
timing, so repro rate may shift.

What to look for:
Hypothesis (1) self-clear:
ProcArrayEndTransaction... self_is_target=t ... clearing
pendingRecoveryConflicts=0x10 (or ProcArrayClearTransaction self case)
for target pid/procno before target ProcessInterrupts reports handler
fired but pending=0.

Hypothesis (2) group-clear-on-behalf:
ProcArrayEndTransactionInternal... self_is_target=f ...
target_procno=<walsender procno> ... clearing
pendingRecoveryConflicts=0x10 (or ProcArrayClearTransaction with
self_is_target=f).

Hypothesis (3) visibility gap (suggestive):
no matching clear-site log for target procno, but target still reports
handler fired with pending=0.

If this run is inconclusive, I suggest a direct behavioral A/B as next step:
- change only the ProcessInterrupts() recovery-conflict check from
pg_atomic_read_u32() to pg_atomic_read_membarrier_u32();
- if failures disappear, that strongly supports hypothesis (3).

--
Best,
Xuneng

Attachment Content-Type Size
035_debugging.patch application/x-patch 7.6 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Nisha Moond 2026-03-09 07:00:42 Re: Skipping schema changes in publication
Previous Message Chao Li 2026-03-09 06:57:03 Re: Add pg_stat_recovery system view