Re: WIP: WAL prefetch (another approach)

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Stephen Frost <sfrost(at)snowman(dot)net>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, David Steele <david(at)pgmasters(dot)net>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Jakub Wartak <Jakub(dot)Wartak(at)tomtom(dot)com>, Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WIP: WAL prefetch (another approach)
Date: 2021-05-02 03:16:07
Message-ID: 3502526.1619925367@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Thomas Munro <thomas(dot)munro(at)gmail(dot)com> writes:
> On Thu, Apr 29, 2021 at 4:45 AM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> Andres Freund <andres(at)anarazel(dot)de> writes:
>>> Tom, any chance you could check if your machine repros the issue before
>>> these commits?

>> Wilco, but it'll likely take a little while to get results ...

> FWIW I also chewed through many megawatts trying to reproduce this on
> a PowerPC system in 64 bit big endian mode, with an emulator. No
> cigar. However, it's so slow that I didn't make it to 10 runs...

So I've expended a lot of kilowatt-hours over the past several days,
and I've got results that are interesting but don't really get us
any closer to a resolution.

To recap, the test lashup is:
* 2003 PowerMac G4 (1.25GHz PPC 7455, 7200 rpm spinning-rust drive)
* Standard debug build (--enable-debug --enable-cassert)
* Out-of-the-box configuration, except add wal_consistency_checking = all
and configure a wal-streaming standby on the same machine
* Repeatedly run "make installcheck-parallel", but skip the tablespace
test to avoid issues with the standby trying to use the same directory
* Delay long enough after each installcheck-parallel to let the
standby catch up (the run proper is ~24 min, plus 2 min for catchup)

The failures I'm seeing generally look like

2021-05-01 15:33:10.968 EDT [8281] FATAL: inconsistent page found, rel 1663/58186/66338, forknum 0, blkno 19
2021-05-01 15:33:10.968 EDT [8281] CONTEXT: WAL redo at 3/4CE905B8 for Gist/PAGE_UPDATE: ; blkref #0: rel 1663/58186/66338, blk 19 FPW

with a variety of WAL record types being named, so it doesn't seem
to be specific to any particular record type. I've twice gotten the
bogus-checksum-and-then-assertion-failure I reported before:

2021-05-01 17:07:52.992 EDT [17464] LOG: incorrect resource manager data checksum in record at 3/E0073EA4
TRAP: FailedAssertion("state->recordRemainLen > 0", File: "xlogreader.c", Line: 567, PID: 17464)

In both of those cases, the WAL on disk was perfectly fine, and the same
is true of most of the "inconsistent page" complaints. So the issue
definitely seems to be about the startup process mis-reading data that
was correctly shipped over.

Anyway, the new and interesting data concerns the relative failure rates
of different builds:

* Recent HEAD (from 4-28 and 5-1): 4 failures in 8 test cycles

* Reverting 1d257577e: 1 failure in 8 test cycles

* Reverting 1d257577e and f003d9f87: 3 failures in 28 cycles

* Reverting 1d257577e, f003d9f87, and 323cbe7c7: 2 failures in 93 cycles

That last point means that there was some hard-to-hit problem even
before any of the recent WAL-related changes. However, 323cbe7c7
(Remove read_page callback from XLogReader) increased the failure
rate by at least a factor of 5, and 1d257577e (Optionally prefetch
referenced data) seems to have increased it by another factor of 4.
But it looks like f003d9f87 (Add circular WAL decoding buffer)
didn't materially change the failure rate.

Considering that 323cbe7c7 was supposed to be just refactoring,
and 1d257577e is allegedly disabled-by-default, these are surely
not the results I was expecting to get.

It seems like it's still an open question whether all this is
a real bug, or flaky hardware. I have seen occasional kernel
freezeups (or so I think -- machine stops responding to keyboard
or network input) over the past year or two, so I cannot in good
conscience rule out the flaky-hardware theory. But it doesn't
smell like that kind of problem to me. I think what we're looking
at is a timing-sensitive bug that was there before (maybe long
before?) and these commits happened to make it occur more often
on this particular hardware. This hardware is enough unlike
anything made in the past decade that it's not hard to credit
that it'd show a timing problem that nobody else can reproduce.

(I did try the time-honored ritual of reseating all the machine's
RAM, partway through this. Doesn't seem to have changed anything.)

Anyway, I'm not sure where to go from here. I'm for sure nowhere
near being able to identify the bug --- and if there really is
a bug that formerly had a one-in-fifty reproduction rate, I have
zero interest in trying to identify where it started by bisecting.
It'd take at least a day per bisection step, and even that might
not be accurate enough. (But, if anyone has ideas of specific
commits to test, I'd be willing to try a few.)

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Julien Rouhaud 2021-05-02 04:27:37 Re: Some oversights in query_id calculation
Previous Message Alvaro Herrera 2021-05-01 22:57:14 Re: Enhanced error message to include hint messages for redundant options error