Re: WIP: WAL prefetch (another approach)

From: Andres Freund <andres(at)anarazel(dot)de>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Stephen Frost <sfrost(at)snowman(dot)net>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, David Steele <david(at)pgmasters(dot)net>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Jakub Wartak <Jakub(dot)Wartak(at)tomtom(dot)com>, Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WIP: WAL prefetch (another approach)
Date: 2021-04-22 01:34:11
Message-ID: 20210422013411.tbcaqqq6c23s2pxy@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2021-04-21 21:21:05 -0400, Tom Lane wrote:
> What I'm doing is running the core regression tests with a single
> standby (on the same machine) and wal_consistency_checking = all.

Do you run them over replication, or sequentially by storing data into
an archive? Just curious, because its so painful to run that scenario in
the replication case due to the tablespace conflicting between
primary/standby, unless one disables the tablespace tests.

> The other PPC machine (with no known history of trouble) is the one
> that had the CRC failure I showed earlier. That one does seem to be
> actual bad data in the stored WAL, because the problem was also seen
> by pg_waldump, and trying to restart the standby got the same failure
> again.

It seems like that could also indicate an xlogreader bug that is
reliably hit? Once it gets confused about record lengths or such I'd
expect CRC failures...

If it were actually wrong WAL contents I don't think any of the
xlogreader / prefetching changes could be responsible...

Have you tried reproducing it on commits before the recent xlogreader
changes?

commit 1d257577e08d3e598011d6850fd1025858de8c8c
Author: Thomas Munro <tmunro(at)postgresql(dot)org>
Date: 2021-04-08 23:03:43 +1200

Optionally prefetch referenced data in recovery.

commit f003d9f8721b3249e4aec8a1946034579d40d42c
Author: Thomas Munro <tmunro(at)postgresql(dot)org>
Date: 2021-04-08 23:03:34 +1200

Add circular WAL decoding buffer.

Discussion: https://postgr.es/m/CA+hUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq=AovOddfHpA@mail.gmail.com

commit 323cbe7c7ddcf18aaf24b7f6d682a45a61d4e31b
Author: Thomas Munro <tmunro(at)postgresql(dot)org>
Date: 2021-04-08 23:03:23 +1200

Remove read_page callback from XLogReader.

Trying 323cbe7c7ddcf18aaf24b7f6d682a45a61d4e31b^ is probably the most
interesting bit.

> I've not been able to duplicate the consistency-check failures
> there. But because that machine is a laptop with a much inferior disk
> drive, the speeds are enough different that it's not real surprising
> if it doesn't hit the same problem.
>
> I've also tried to reproduce on 32-bit and 64-bit Intel, without
> success. So if this is real, maybe it's related to being big-endian
> hardware? But it's also quite sensitive to $dunno-what, maybe the
> history of WAL records that have already been replayed.

It might just be disk speed influencing how long the tests take, which
in turn increase the number of times checkpoints during the test,
increasing the number of FPIs?

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Fujii Masao 2021-04-22 01:56:10 Re: Stale description for pg_basebackup
Previous Message Tom Lane 2021-04-22 01:21:05 Re: WIP: WAL prefetch (another approach)