Re: WIP: WAL prefetch (another approach)

From: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
To: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Stephen Frost <sfrost(at)snowman(dot)net>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, David Steele <david(at)pgmasters(dot)net>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Jakub Wartak <Jakub(dot)Wartak(at)tomtom(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WIP: WAL prefetch (another approach)
Date: 2021-05-04 12:37:22
Message-ID: f2be6caa-5a7a-990b-c56e-a29454ae1cee@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 5/3/21 7:42 AM, Thomas Munro wrote:
> On Sun, May 2, 2021 at 3:16 PM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> That last point means that there was some hard-to-hit problem even
>> before any of the recent WAL-related changes. However, 323cbe7c7
>> (Remove read_page callback from XLogReader) increased the failure
>> rate by at least a factor of 5, and 1d257577e (Optionally prefetch
>> referenced data) seems to have increased it by another factor of 4.
>> But it looks like f003d9f87 (Add circular WAL decoding buffer)
>> didn't materially change the failure rate.
>
> Oh, wow. There are several surprising results there. Thanks for
> running those tests for so long so that we could see the rarest
> failures.
>
> Even if there are somehow *two* causes of corruption, one preexisting
> and one added by the refactoring or decoding patches, I'm struggling
> to understand how the chance increases with 1d2575, since that only
> adds code that isn't reached when not enabled (though I'm going to
> re-review that).
>
>> Considering that 323cbe7c7 was supposed to be just refactoring,
>> and 1d257577e is allegedly disabled-by-default, these are surely
>> not the results I was expecting to get.
>
> +1
>
>> It seems like it's still an open question whether all this is
>> a real bug, or flaky hardware. I have seen occasional kernel
>> freezeups (or so I think -- machine stops responding to keyboard
>> or network input) over the past year or two, so I cannot in good
>> conscience rule out the flaky-hardware theory. But it doesn't
>> smell like that kind of problem to me. I think what we're looking
>> at is a timing-sensitive bug that was there before (maybe long
>> before?) and these commits happened to make it occur more often
>> on this particular hardware. This hardware is enough unlike
>> anything made in the past decade that it's not hard to credit
>> that it'd show a timing problem that nobody else can reproduce.
>
> Hmm, yeah that does seem plausible. It would be nice to see a report
> from any other system though. I'm still trying, and reviewing...
>

FWIW I've ran the test (make installcheck-parallel in a loop) on four
different machines - two x86_64 ones, and two rpi4. The x86 boxes did
~1000 rounds each (and one of them had 5 local replicas) without any
issue. The rpi4 machines did ~50 rounds each, also without failures.

Obviously, it's possible there's something that neither of those (very
different systems) triggers, but I'd say it might also be a hint that
this really is a hw issue on the old ppc macs. Or maybe something very
specific to that arch.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message vignesh C 2021-05-04 13:20:15 Re: Identify missing publications from publisher while create/alter subscription.
Previous Message Dilip Kumar 2021-05-04 12:11:06 Re: Race condition in recovery?