Re: XLogReadRecord() error in XlogReadTwoPhaseData()

From: Noah Misch <noah(at)leadboat(dot)com>
To: pgbf(at)twiska(dot)com
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Michael Paquier <michael(at)paquier(dot)xyz>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: XLogReadRecord() error in XlogReadTwoPhaseData()
Date: 2021-11-20 05:18:23
Message-ID: 20211120051823.GA1367142@rfd.leadboat.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Nov 17, 2021 at 11:05:06PM -0800, Noah Misch wrote:
> On Wed, Nov 17, 2021 at 05:47:10PM -0500, Tom Lane wrote:
> > Noah Misch <noah(at)leadboat(dot)com> writes:
> > > Each of the three failures happened on a sparc64 Debian+gcc machine. I had
> > > tried ~8000 iterations on thorntail, another sparc64 Debian+gcc animal,
> > > without reproducing this.
>
> > # 'pgbench: error: client 0 script 1 aborted in command 4 query 0: ERROR: could not read two-phase state from WAL at 0/159EF88: unexpected pageaddr 0/0 in log segment 000000010000000000000001, offset 5890048
> > [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tadarida&dt=2021-11-17%2013%3A01%3A24
>
> Two others:
> ERROR: could not read two-phase state from WAL at 0/16F1850: invalid record length at 0/16F1850: wanted 24, got 0
> -- https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tadarida&dt=2021-11-12%2013%3A01%3A15
> ERROR: could not read two-phase state from WAL at 0/1668020: incorrect resource manager data checksum in record at 0/1668020
> -- https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=kittiwake&dt=2021-11-16%2015%3A00%3A52
>
> > I suppose "unexpected pageaddr 0/0" is most easily explained by supposing
> > that XlogReadTwoPhaseData tried to read a WAL page that hadn't been
> > written out yet. Have we got any synchronization around that?
>
> If the WAL address isn't on disk, that error doesn't happen. Instead,
> read_local_xlog_page() blocks waiting for the WAL to become available. It's
> still possible that we make the WAL region exist, but it somehow doesn't
> contain the right data until shortly later. FinishPreparedTransaction() takes
> TwoPhaseStateLock and looks for an entry having gxact->valid. EndPrepare()
> fills gxact->prepare_end_lsn, then calls MarkAsPrepared() to set gxact->valid
> under TwoPhaseStateLock. All freelist (freeGXacts) interaction holds
> TwoPhaseStateLock. I'm not seeing a gap in that synchronization.
>
> I don't have a great theory, but here are candidates to examine next:
>
> - Run with wal_debug=on to confirm logged write location matches read location.
> - Run "PGDATA=contrib/amcheck/tmp_check/t_003_cic_2pc_CIC_2PC_test_data/pgdata
> pg_waldump -s 0/01000000" at the end of the test.
> - Dump WAL page binary image at the point of failure.
> - Log which branches in XLogReadRecord() are taken.

Tom Turelinckx, are you able to provide remote access to kittiwake or
tadarida? I'd use it to attempt the above things. All else being equal,
kittiwake is more relevant since it's still supported upstream.

The setup of your buildfarm animals provides a clue. I understand kittiwake
is the same as ibisbill except for build options, and tadarida is the same as
mussurana except for build options. ibisbill and mussurana haven't failed, so
one of these is likely needed to reproduce:

absence of --enable-cassert
CFLAGS='-g -O2 -fstack-protector -Wformat -Werror=format-security '
CPPFLAGS='-Wdate-time -D_FORTIFY_SOURCE=2'
LDFLAGS='-Wl,-z,relro -Wl,-z,now'

However, I can't reproduce this on thorntail, even if I use tadarida's
./configure options, CFLAGS, CPPFLAGS, and LDFLAGS.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2021-11-20 05:22:28 Re: XLogReadRecord() error in XlogReadTwoPhaseData()
Previous Message Bharath Rupireddy 2021-11-20 04:33:27 Re: pg_waldump stucks with options --follow or -f and --stats or -z