Re: WIP: WAL prefetch (another approach)

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc: Stephen Frost <sfrost(at)snowman(dot)net>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, David Steele <david(at)pgmasters(dot)net>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Jakub Wartak <Jakub(dot)Wartak(at)tomtom(dot)com>, Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WIP: WAL prefetch (another approach)
Date: 2021-04-22 01:21:05
Message-ID: 892344.1619054465@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Stephen Frost <sfrost(at)snowman(dot)net> writes:
> On Wed, Apr 21, 2021 at 19:17 Thomas Munro <thomas(dot)munro(at)gmail(dot)com> wrote:
>> ... Personally I think the right thing to do now is to revert it
>> and re-propose for 15 early in the cycle, supported with some better
>> testing infrastructure.

> I tend to agree with the idea to revert it, perhaps a +0 on that, but if
> others argue it should be fixed in-place, I wouldn’t complain about it.

FWIW, I've so far only been able to see problems on two old PPC Macs,
one of which has been known to be a bit flaky in the past. So it's
possible that what I'm looking at is a hardware glitch. But it's
consistent enough that I rather doubt that.

What I'm doing is running the core regression tests with a single
standby (on the same machine) and wal_consistency_checking = all.
Fairly reproducibly (more than one run in ten), what I get on the
slightly-flaky machine is consistency check failures like

2021-04-21 17:42:56.324 EDT [42286] PANIC: inconsistent page found, rel 1663/354383/357033, forknum 0, blkno 9, byte offset 2069: replay 0x00 primary 0x03
2021-04-21 17:42:56.324 EDT [42286] CONTEXT: WAL redo at 24/121C97B0 for Heap/INSERT: off 107 flags 0x00; blkref #0: rel 1663/354383/357033, blk 9 FPW
2021-04-21 17:45:11.662 EDT [42284] LOG: startup process (PID 42286) was terminated by signal 6: Abort trap

2021-04-21 11:25:30.091 EDT [38891] PANIC: inconsistent page found, rel 1663/229880/237980, forknum 0, blkno 108, byte offset 3845: replay 0x00 primary 0x99
2021-04-21 11:25:30.091 EDT [38891] CONTEXT: WAL redo at 17/A99897FC for SPGist/ADD_LEAF: add leaf to page; off 241; headoff 171; parentoff 0; blkref #0: rel 1663/229880/237980, blk 108 FPW
2021-04-21 11:26:59.371 EDT [38889] LOG: startup process (PID 38891) was terminated by signal 6: Abort trap

2021-04-20 19:20:16.114 EDT [34405] PANIC: inconsistent page found, rel 1663/189216/197311, forknum 0, blkno 115, byte offset 6149: replay 0x37 primary 0x03
2021-04-20 19:20:16.114 EDT [34405] CONTEXT: WAL redo at 13/3CBFED00 for SPGist/ADD_LEAF: add leaf to page; off 241; headoff 171; parentoff 0; blkref #0: rel 1663/189216/197311, blk 115 FPW
2021-04-20 19:21:54.421 EDT [34403] LOG: startup process (PID 34405) was terminated by signal 6: Abort trap

2021-04-20 17:44:09.356 EDT [24106] FATAL: inconsistent page found, rel 1663/135419/143843, forknum 0, blkno 101, byte offset 6152: replay 0x40 primary 0x00
2021-04-20 17:44:09.356 EDT [24106] CONTEXT: WAL redo at D/5107D8A8 for Gist/PAGE_UPDATE: ; blkref #0: rel 1663/135419/143843, blk 101 FPW

(Note I modified checkXLogConsistency to PANIC on failure, so I could get
a core dump to analyze; and it's also printing the first-mismatch location.)

I have not analyzed each one of these failures exhaustively, but on the
ones I have looked at closely, the replay_image_masked version of the page
appears correct while the primary_image_masked version is *not*.
Moreover, the primary_image_masked version does not match the full-page
image that I see in the on-disk WAL file. It did however seem to match
the in-memory WAL record contents that the decoder is operating on.
So unless you want to believe the buggy-hardware theory, something's
occasionally messing up while loading WAL records from disk. All of the
trouble cases involve records that span across WAL pages (unsurprising
since they contain FPIs), so maybe there's something not quite right
in there.

In the cases that I looked at closely, it appeared that there was a
block of 32 wrong bytes somewhere within the page image, with the data
before and after that being correct. I'm not sure if that pattern
holds in all cases though.

BTW, if I restart the failed standby, it plows through the same data
just fine, confirming that the on-disk WAL is not corrupt.

The other PPC machine (with no known history of trouble) is the one
that had the CRC failure I showed earlier. That one does seem to be
actual bad data in the stored WAL, because the problem was also seen
by pg_waldump, and trying to restart the standby got the same failure
again. I've not been able to duplicate the consistency-check failures
there. But because that machine is a laptop with a much inferior disk
drive, the speeds are enough different that it's not real surprising
if it doesn't hit the same problem.

I've also tried to reproduce on 32-bit and 64-bit Intel, without
success. So if this is real, maybe it's related to being big-endian
hardware? But it's also quite sensitive to $dunno-what, maybe the
history of WAL records that have already been replayed.

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2021-04-22 01:34:11 Re: WIP: WAL prefetch (another approach)
Previous Message Stephen Frost 2021-04-22 00:55:29 Re: PATCH: Add GSSAPI ccache_name option to libpq