Re: XLogReadRecord() error in XlogReadTwoPhaseData()

From: Noah Misch <noah(at)leadboat(dot)com>
To: pgbf(at)twiska(dot)com
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Michael Paquier <michael(at)paquier(dot)xyz>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: XLogReadRecord() error in XlogReadTwoPhaseData()
Date: 2022-01-16 07:12:10
Message-ID: 20220116071210.GA735692@rfd.leadboat.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Nov 19, 2021 at 09:18:23PM -0800, Noah Misch wrote:
> On Wed, Nov 17, 2021 at 11:05:06PM -0800, Noah Misch wrote:
> > On Wed, Nov 17, 2021 at 05:47:10PM -0500, Tom Lane wrote:
> > > Noah Misch <noah(at)leadboat(dot)com> writes:
> > > > Each of the three failures happened on a sparc64 Debian+gcc machine. I had
> > > > tried ~8000 iterations on thorntail, another sparc64 Debian+gcc animal,
> > > > without reproducing this.
> >
> > > # 'pgbench: error: client 0 script 1 aborted in command 4 query 0: ERROR: could not read two-phase state from WAL at 0/159EF88: unexpected pageaddr 0/0 in log segment 000000010000000000000001, offset 5890048
> > > [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tadarida&dt=2021-11-17%2013%3A01%3A24
> >
> > Two others:
> > ERROR: could not read two-phase state from WAL at 0/16F1850: invalid record length at 0/16F1850: wanted 24, got 0
> > -- https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tadarida&dt=2021-11-12%2013%3A01%3A15
> > ERROR: could not read two-phase state from WAL at 0/1668020: incorrect resource manager data checksum in record at 0/1668020
> > -- https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=kittiwake&dt=2021-11-16%2015%3A00%3A52

> > I don't have a great theory, but here are candidates to examine next:
> >
> > - Run with wal_debug=on to confirm logged write location matches read location.
> > - Run "PGDATA=contrib/amcheck/tmp_check/t_003_cic_2pc_CIC_2PC_test_data/pgdata
> > pg_waldump -s 0/01000000" at the end of the test.
> > - Dump WAL page binary image at the point of failure.
> > - Log which branches in XLogReadRecord() are taken.
>
> Tom Turelinckx, are you able to provide remote access to kittiwake or
> tadarida? I'd use it to attempt the above things. All else being equal,
> kittiwake is more relevant since it's still supported upstream.

Thanks for setting up access. Summary: this kernel has a bug in I/O syscalls.
How practical is it to update that kernel? (Userland differs across these
animals, but the kernel does not.) The kernel on buildfarm member thorntail
doesn't exhibit the bug.

For specifics of the kernel bug, see the attached test program. In brief, the
bug arises if one process is write()ing or pwrite()ing a file at about the
same time that another process is read()ing or pread()ing the same. POSIX
says the reader should see the data as it existed before the write or the
newly-written data. On this kernel, the reader can see zeros instead. That
leads to the $SUBJECT failure. PostgreSQL processes write out a given WAL
block 20-30 times in ~10ms, and COMMIT PREPARED reads that block. The writers
aren't changing the bytes of interest to COMMIT PREPARED, but the zeros from
the kernel bug yield the failure. We could opt to work around that by writing
only the not-already-written portion of a WAL block, but I doubt that's
worthwhile unless it happens to be a performance win anyway.

Separately, while I don't know of relevance to PostgreSQL, I was interested to
see that CentOS 7 pwrite()/pread() fail to have the POSIX-required atomicity.

> The setup of your buildfarm animals provides a clue. I understand kittiwake
> is the same as ibisbill except for build options, and tadarida is the same as
> mussurana except for build options. ibisbill and mussurana haven't failed, so
> one of these is likely needed to reproduce:
>
> absence of --enable-cassert
> CFLAGS='-g -O2 -fstack-protector -Wformat -Werror=format-security '
> CPPFLAGS='-Wdate-time -D_FORTIFY_SOURCE=2'
> LDFLAGS='-Wl,-z,relro -Wl,-z,now'

That was a red herring. ibisbill and mussurana don't use --with-tap-tests.
Adding --with-tap-tests has been enough to make their configurations witness
the same kinds of failures.

nm

Attachment Content-Type Size
io-rectitude.c text/plain 5.1 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Thomas Munro 2022-01-16 07:32:17 Re: Large Pages and Super Pages for PostgreSQL
Previous Message Andrey Borodin 2022-01-16 05:36:08 Re: MultiXact/SLRU buffers configuration