Re: BUG #17928: Standby fails to decode WAL on termination of primary

From: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To: Michael Paquier <michael(at)paquier(dot)xyz>
Cc: Alexander Lakhin <exclusion(at)gmail(dot)com>, Sergei Kornilov <sk(at)zsrv(dot)org>, Noah Misch <noah(at)leadboat(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, pgsql-bugs(at)lists(dot)postgresql(dot)org, Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>, pgbf(at)twiska(dot)com
Subject: Re: BUG #17928: Standby fails to decode WAL on termination of primary
Date: 2023-09-25 00:18:56
Message-ID: CA+hUKG+cXwfAk0dEbD5CZ76p1uADGywuvb6_N4Uhziek54FZHg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On Mon, Sep 25, 2023 at 12:58 PM Michael Paquier <michael(at)paquier(dot)xyz> wrote:
> On Mon, Sep 25, 2023 at 09:02:35AM +1300, Thomas Munro wrote:
> > I see there was a failure on 16 on the very slow AIX box, and I have
> > access so looking into that...
>
> Lucky you, if I may say ;)

FTR anyone involved with an open source project can get an account on
the GCC compile farm machines. That particular machine is so
overloaded that it's practically unusable (~8 hours to run the test,
hard to run vi etc).

> A bunch of architectures that are not Intel are failing. Here is a
> summary based on the buildfarm reports:
> topminnow, mips64el with gcc 4.9.2
> mereswine, ARMv7 with gcc 10.2.1
> sungazer, ppc64 with gcc 8.3.0
> frogfish, mips64el with gcc 4.6.3
> mamba, macppc with gcc 10.4.0
> gull, ARMv7 with clang 13.0.0
> grison, ARMv7 with gcc 4.6.3
> copperhead, riscv64 with gcc 10.X
>
> The only thing close to that I have close by is tanager on Armv7 (it
> has not reported to the buildfarm for a few weeks as it has
> overheated because of the summer here, but I've put it back online
> now). However, it has passed a few hundred cycles with both gcc and
> clang yesterday, on top of having a clean buildfarm run.

One thing that the failing systems have in common is that they are
extremely slow. 3 to 8 hours to complete the tests. turaco is an
armv7 system that doesn't fail, but it's much faster. At a guess,
probably something like an armv8 CPU that is just running 32 bit armv7
software, not a real old school armv7 chip.

Which gives me the idea to try these tests under qemu...

> With sungazer now failing on REL_16_STABLE, it feels to me that we are
> actually looking at two bugs? One on HEAD, and one in stable
> branches? For HEAD and the 2PC failure, the records up to PREPARE
> TRANSACTION should be replayed by the standby getting promoted, but
> I'd rather dig into that with a host that's able to report the
> failure.

Oh, right yeah that is quite different and could even be unrelated.

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Noah Misch 2023-09-25 01:32:44 Re: BUG #17928: Standby fails to decode WAL on termination of primary
Previous Message Michael Paquier 2023-09-24 23:58:45 Re: BUG #17928: Standby fails to decode WAL on termination of primary