Re: [BUG] non archived WAL removed during production crash recovery

From: Jehan-Guillaume de Rorthais <jgdr(at)dalibo(dot)com>
To: Michael Paquier <michael(at)paquier(dot)xyz>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, masao(dot)fujii(at)oss(dot)nttdata(dot)com, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: [BUG] non archived WAL removed during production crash recovery
Date: 2020-04-24 13:03:00
Message-ID: 20200424150300.1b3b0c20@firost
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs pgsql-hackers

On Fri, 24 Apr 2020 12:43:51 +0900
Michael Paquier <michael(at)paquier(dot)xyz> wrote:

> On Thu, Apr 23, 2020 at 10:21:15PM -0400, Tom Lane wrote:
> > Looks like the news is not good :-(
>
> Yes, I was looking at that for the last couple of hours, and just
> pushed something to put back the buildfarm to a green state for now
> (based on the first results things seem stable now) by removing the
> defective subset of tests.
>
> > I see that my own florican is one of the failing critters, though
> > it failed only on HEAD which seems odd. Any suggestions what to
> > look for?
>
> The issue comes from the parts of the test where we expect some .ready
> files to exist (or not) after triggering a restartpoint to force some
> segments to be recycled. And looking more at it, I suspect that the
> issue is actually that we don't make sure in the test that the
> standbys started have replayed up to the segment switch record
> triggered on the primary (the one within generate_series(10,20)), and
> then the follow-up restart point does not actually recycle the
> segments we expect to recycle. That's more likely going to be a
> problem on slower machines as the window gets wider between the moment
> the standbys reach their consistency point and the moment the switch
> record is replayed.

Indeed.

In regard with your fix, as we don't know if the standby caught up with the
latest available record, there's really no point to keep this test either:

# Recovery with archive_mode=on should not create .ready files.
# Note that this segment did not exist in the backup.
ok( !-f "$standby1_data/$segment_path_2_ready",
".ready file for WAL segment $segment_name_2 not created on standby
when archive_mode=on on standby" );

I agree the three tests could be removed as they were not covering the bug we
were chasing. However, they might still be useful to detect futur non expected
behavior changes. If you agree with this, please, find in attachment a patch
proposal against HEAD that recreate these three tests **after** a waiting loop
on both standby1 and standby2. This waiting loop is inspired from the tests in
9.5 -> 10.

Regards,

Attachment Content-Type Size
wait-for-wal-replay.patch text/x-patch 3.8 KB

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Euler Taveira 2020-04-24 15:13:01 Re: BUG #16386: drop contraint in inherited table is missing in pg_dump backup
Previous Message Devrim Gündüz 2020-04-24 11:23:24 Re: BUG #16385: Postgres YUM repo broke

Browse pgsql-hackers by date

  From Date Subject
Next Message tushar 2020-04-24 13:03:03 Re: [Proposal] Global temporary tables
Previous Message Antonin Houska 2020-04-24 13:01:09 Re: WIP: Aggregation push-down