Re: WIP: WAL prefetch (another approach)

From: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Stephen Frost <sfrost(at)snowman(dot)net>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, David Steele <david(at)pgmasters(dot)net>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Jakub Wartak <Jakub(dot)Wartak(at)tomtom(dot)com>, Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WIP: WAL prefetch (another approach)
Date: 2021-05-03 05:42:35
Message-ID: CA+hUKGLGKnzTDhW9rnzyYb7yvUiCoGK5pB74qQf3YYLZYSX4OA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sun, May 2, 2021 at 3:16 PM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> That last point means that there was some hard-to-hit problem even
> before any of the recent WAL-related changes. However, 323cbe7c7
> (Remove read_page callback from XLogReader) increased the failure
> rate by at least a factor of 5, and 1d257577e (Optionally prefetch
> referenced data) seems to have increased it by another factor of 4.
> But it looks like f003d9f87 (Add circular WAL decoding buffer)
> didn't materially change the failure rate.

Oh, wow. There are several surprising results there. Thanks for
running those tests for so long so that we could see the rarest
failures.

Even if there are somehow *two* causes of corruption, one preexisting
and one added by the refactoring or decoding patches, I'm struggling
to understand how the chance increases with 1d2575, since that only
adds code that isn't reached when not enabled (though I'm going to
re-review that).

> Considering that 323cbe7c7 was supposed to be just refactoring,
> and 1d257577e is allegedly disabled-by-default, these are surely
> not the results I was expecting to get.

+1

> It seems like it's still an open question whether all this is
> a real bug, or flaky hardware. I have seen occasional kernel
> freezeups (or so I think -- machine stops responding to keyboard
> or network input) over the past year or two, so I cannot in good
> conscience rule out the flaky-hardware theory. But it doesn't
> smell like that kind of problem to me. I think what we're looking
> at is a timing-sensitive bug that was there before (maybe long
> before?) and these commits happened to make it occur more often
> on this particular hardware. This hardware is enough unlike
> anything made in the past decade that it's not hard to credit
> that it'd show a timing problem that nobody else can reproduce.

Hmm, yeah that does seem plausible. It would be nice to see a report
from any other system though. I'm still trying, and reviewing...

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Noah Misch 2021-05-03 05:57:47 Re: Dump public schema ownership & seclabels
Previous Message Bharath Rupireddy 2021-05-03 05:41:12 Re: Identify missing publications from publisher while create/alter subscription.