Re: A failure in 031_recovery_conflict.pl on Debian/s390x

From: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To: Christoph Berg <myon(at)debian(dot)org>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Noah Misch <noah(at)leadboat(dot)com>, Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: A failure in 031_recovery_conflict.pl on Debian/s390x
Date: 2023-08-10 10:55:24
Message-ID: CA+hUKGJs8mskHt=38dFQYkucv0H44xTy=EDF0=D0sGuJms3DBw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Aug 10, 2023 at 9:15 PM Christoph Berg <myon(at)debian(dot)org> wrote:
> No XXX lines this time either, but I've seen then im logfiles that
> went through successfully.

Hmm. Well, I think this looks like a different kind of bug then.
That patch of mine is about fixing some unsafe coding on the receiving
side of a signal. In this case it's apparently not being sent. So
either the Heap2/PRUNE record was able to proceed (indicating that
that CURSOR was not holding a pin as expected), or VACUUM decided not
to actually do anything to that block (conditional cleanup lock vs
transient pin changing behaviour?), or there's a bug somewhere in/near
LockBufferForCleanup(), which should have emitted that XXX message
before even calling ResolveRecoveryConflictWithBufferPin().

Do you still have the data directories around from that run, so we can
see if the expected Heap2/PRUNE was actually logged? For example
(using meson layout here, in the build directory) that'd be something
like:

$ ./tmp_install/home/tmunro/install/bin/pg_waldump
testrun/recovery/031_recovery_conflict/data/t_031_recovery_conflict_standby_data/pgdata/pg_wal/000000010000000000000003

In there I see this:

rmgr: Heap2 len (rec/tot): 57/ 57, tx: 0, lsn:
0/0344BB90, prev 0/0344BB68, desc: PRUNE snapshotConflictHorizon: 0,
nredirected: 0, ndead: 1, nunused: 0, redirected: [], dead: [21],
unused: [], blkref #0: rel 1663/16385/16386 blk 0

That's the WAL record that's supposed to be causing
031_recovery_conflict_standby.log to talk about a conflict, starting
with this:

2023-08-10 22:47:04.564 NZST [57145] LOG: recovery still waiting
after 10.035 ms: recovery conflict on buffer pin
2023-08-10 22:47:04.564 NZST [57145] CONTEXT: WAL redo at 0/344BB90
for Heap2/PRUNE: snapshotConflictHorizon: 0, nredirected: 0, ndead: 1,
nunused: 0, redirected: [], dead: [21], unused: []; blkref #0: rel
1663/16385/16386, blk 0

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message John Naylor 2023-08-10 11:37:24 Re: [PATCH] Add loongarch native checksum implementation.
Previous Message Michael Paquier 2023-08-10 10:54:41 Re: [PATCH] Add loongarch native checksum implementation.