Re: Refactor recovery conflict signaling a little

From: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To: Alexander Lakhin <exclusion(at)gmail(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, Chao Li <li(dot)evan(dot)chao(at)gmail(dot)com>
Subject: Re: Refactor recovery conflict signaling a little
Date: 2026-03-03 15:39:27
Message-ID: 1f354762-c6a8-46a5-9626-efeb45f5f112@iki.fi
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 24/02/2026 10:00, Alexander Lakhin wrote:
> Could you please look at a new failure of 035_standby_logical_decoding.pl
> produced at buildfarm [1]?:
> [05:34:51.345](233.134s) # poll_query_until timed out executing this query:
> # SELECT '0/0403F950' <= replay_lsn AND state = 'streaming'
> #          FROM pg_catalog.pg_stat_replication
> #          WHERE application_name IN ('standby', 'walreceiver')
> # expecting this output:
> # t
> # last actual query output:
> # f
> # with stderr:
>
> 035_standby_logical_decoding_standby.log contains:
> 2026-02-24 05:30:58.300 CET [1512377][client backend][:0] LOG:
> disconnection: session time: 0:00:00.035 user=bf database=testdb
> host=[local]
> 2026-02-24 05:30:58.425 CET [1507982][startup][32/0:0] LOG: invalidating
> obsolete replication slot "row_removal_inactiveslot"
> 2026-02-24 05:30:58.425 CET [1507982][startup][32/0:0] DETAIL:  The slot
> conflicted with xid horizon 748.
> 2026-02-24 05:30:58.425 CET [1507982][startup][32/0:0] CONTEXT:  WAL
> redo at 0/040214F0 for Heap2/PRUNE_ON_ACCESS: snapshotConflictHorizon:
> 748, isCatalogRel: T, nplans: 0, nredirected: 0, ndead: 3, nunused: 1,
> dead: [33, 34, 35], unused: [36]; blkref #0: rel 1663/16384/16418, blk 10
> 2026-02-24 05:30:58.425 CET [1507982][startup][32/0:0] LOG: terminating
> process 1512360 to release replication slot "row_removal_activeslot"
> 2026-02-24 05:30:58.425 CET [1507982][startup][32/0:0] DETAIL:  The slot
> conflicted with xid horizon 748.
> 2026-02-24 05:30:58.425 CET [1507982][startup][32/0:0] CONTEXT:  WAL
> redo at 0/040214F0 for Heap2/PRUNE_ON_ACCESS: snapshotConflictHorizon:
> 748, isCatalogRel: T, nplans: 0, nredirected: 0, ndead: 3, nunused: 1,
> dead: [33, 34, 35], unused: [36]; blkref #0: rel 1663/16384/16418, blk 10
> 2026-02-24 05:34:51.400 CET [1508227][walreceiver][:0] FATAL:  could not
> receive data from WAL stream: server closed the connection unexpectedly
>         This probably means the server terminated abnormally
>         before or while processing the request.
>
> The "terminating process ..." message doesn't appear when the test passes
> successfully.

Hmm, right, looks like something wrong in signaling the recovery
conflict. I can't tell if the signal is being sent, or it's not
processed correctly. Looking at the code, I don't see anything wrong.

> I've managed to reproduce this with multiple (20) test instances running
> in a loop (it failed within 10 iterations for me); `git bisect` for this
> anomaly pointed at 17f51ea81.

I've been trying to reproduce this locally, but so far not success,
after thousands of iterations.

If you can still reproduce this, can you try it with code changes from
the attached recovery-conflict-fail-extra-logging.patch, which adds some
extra logging, and send over the logs please?

The recovery-conflict-fail-repro-attempt.patch contains very hacky
changes to the test, to run the just the failing part 100 times in a
loop. That's just to show what I used to try to reproduce this, but no luck.

- Heikki

Attachment Content-Type Size
recovery-conflict-fail-extra-logging.patch text/x-patch 2.7 KB
recovery-conflict-fail-repro-attempt.patch text/x-patch 24.3 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Shlok Kyal 2026-03-03 15:46:44 Re: Skipping schema changes in publication
Previous Message Tom Lane 2026-03-03 15:31:06 Re: [BUG?] estimate_hash_bucket_stats uses wrong ndistinct for avgfreq