Quick Links

Re: Refactor recovery conflict signaling a little

From:	Xuneng Zhou <xunengzhou(at)gmail(dot)com>
To:	Alexander Lakhin <exclusion(at)gmail(dot)com>
Cc:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, Chao Li <li(dot)evan(dot)chao(at)gmail(dot)com>
Subject:	Re: Refactor recovery conflict signaling a little
Date:	2026-03-04 05:33:21
Message-ID:	CABPTF7X5+fQ6kOesnk1Q7g3cgz858CYMkbk-oUNGzDS1bb6F4Q@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Hi Alexander,

On Wed, Mar 4, 2026 at 3:00 AM Alexander Lakhin <exclusion(at)gmail(dot)com> wrote:
>
> Hello Heikki,
>
> 03.03.2026 17:39, Heikki Linnakangas wrote:
> > On 24/02/2026 10:00, Alexander Lakhin wrote:
> >> The "terminating process ..." message doesn't appear when the test passes
> >> successfully.
> >
> > Hmm, right, looks like something wrong in signaling the recovery conflict. I can't tell if the signal is being sent,
> > or it's not processed correctly. Looking at the code, I don't see anything wrong.
> >
> >> I've managed to reproduce this with multiple (20) test instances running
> >> in a loop (it failed within 10 iterations for me); `git bisect` for this
> >> anomaly pointed at 17f51ea81.
> >
> > I've been trying to reproduce this locally, but so far not success, after thousands of iterations.
> >
> > If you can still reproduce this, can you try it with code changes from the attached
> > recovery-conflict-fail-extra-logging.patch, which adds some extra logging, and send over the logs please?
> >
> > The recovery-conflict-fail-repro-attempt.patch contains very hacky changes to the test, to run the just the failing
> > part 100 times in a loop. That's just to show what I used to try to reproduce this, but no luck.
>
> Thank you for your attention to it!
>
> I can easily reproduce the failure with the attached script when using SSD
> (the test didn't fail for me on tmpfs and HDD, probably some tuning needed
> ):
> ITERATION 1
> ...
> 19 t/035_standby_logical_decoding.pl .. ok
> 19 All tests successful.
> 19 Files=1, Tests=28, 20 wallclock secs ( 0.02 usr 0.01 sys + 0.35 cusr 0.80 csys = 1.18 CPU)
> 19 Result: PASS
> 2 # poll_query_until timed out executing this query:
> 2 # SELECT '0/0403F950' <= replay_lsn AND state = 'streaming'
> 2 # FROM pg_catalog.pg_stat_replication
> 2 # WHERE application_name IN ('standby', 'walreceiver')
> 2 # expecting this output:
> 2 # t
> 2 # last actual query output:
> 2 # f
> 2 # with stderr:
> ...
>
> 035_standby_logical_decoding_standby.log contains:
> 2026-03-03 20:34:14.198 EET startup[844699] LOG: invalidating obsolete replication slot "row_removal_inactiveslot"
> 2026-03-03 20:34:14.198 EET startup[844699] DETAIL: The slot conflicted with xid horizon 748.
> 2026-03-03 20:34:14.198 EET startup[844699] CONTEXT: WAL redo at 0/040214F0 for Heap2/PRUNE_ON_ACCESS:
> snapshotConflictHorizon: 748, isCatalogRel: T, nplans: 0, nredirected: 0, ndead: 3, nunused: 1, dead: [33, 34, 35],
> unused: [36]; blkref #0: rel 1663/16384/16418, blk 10
> 2026-03-03 20:34:14.198 EET startup[844699] LOG: terminating process 845647 to release replication slot
> "row_removal_activeslot"
> 2026-03-03 20:34:14.198 EET startup[844699] DETAIL: The slot conflicted with xid horizon 748.
> 2026-03-03 20:34:14.198 EET startup[844699] CONTEXT: WAL redo at 0/040214F0 for Heap2/PRUNE_ON_ACCESS:
> snapshotConflictHorizon: 748, isCatalogRel: T, nplans: 0, nredirected: 0, ndead: 3, nunused: 1, dead: [33, 34, 35],
> unused: [36]; blkref #0: rel 1663/16384/16418, blk 10
> 2026-03-03 20:34:14.198 EET startup[844699] LOG: XXX: SendProcSignal sending SIGUSR1 to pid 845647
> 2026-03-03 20:34:14.198 EET startup[844699] CONTEXT: WAL redo at 0/040214F0 for Heap2/PRUNE_ON_ACCESS:
> snapshotConflictHorizon: 748, isCatalogRel: T, nplans: 0, nredirected: 0, ndead: 3, nunused: 1, dead: [33, 34, 35],
> unused: [36]; blkref #0: rel 1663/16384/16418, blk 10
> 2026-03-03 20:37:26.827 EET walreceiver[844817] FATAL: could not receive data from WAL stream: server closed the
> connection unexpectedly
> This probably means the server terminated abnormally
> before or while processing the request.
>
> The full logs are attached too. I can add any other logging you want.
>
> Best regards,
> Alexander

I was unable to reproduce the issue on an x86_64 Linux machine using
the provided script. All test runs completed successfully without any
failures.

--
Best,
Xuneng

In response to

Re: Refactor recovery conflict signaling a little at 2026-03-03 19:00:00 from Alexander Lakhin

Responses

Re: Refactor recovery conflict signaling a little at 2026-03-07 11:00:01 from Alexander Lakhin

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Soumya S Murali	2026-03-04 05:43:33	Re: support create index on virtual generated column.
Previous Message	Michael Paquier	2026-03-04 05:31:29	Re: BUG: Former primary node might stuck when started as a standby