Re: pgsql: Add TAP test for archive_cleanup_command and recovery_end_comman

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Michael Paquier <michael(at)paquier(dot)xyz>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: pgsql: Add TAP test for archive_cleanup_command and recovery_end_comman
Date: 2022-04-08 21:55:51
Message-ID: 4015413.1649454951@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-committers pgsql-hackers

Andres Freund <andres(at)anarazel(dot)de> writes:
> On 2022-04-07 13:57:45 -0400, Tom Lane wrote:
>> Yeah, with only one instance it could just be cosmic rays or something.
>> However, assuming it is real, I guess I wonder why we don't say
>> CHECKPOINT_FORCE in standby mode too.

> I guess it might partially be that restartpoints require a checkpoint to have
> happened on the primary. If we used FORCE, we'd have to wait till the next
> checkpoint on the primary, which'd be a problem if it's e.g. a manually issued
> CHECKPOINT; before shutting the standby down.

After seeing skink's results, I tried running that test under valgrind
here, and it fails just like that every time. skink's history allows
us to bound the failure introduction between 79b716cfb7 and
d7ab2a9a3c, which I think makes it just about certain that it was
5dc0418fab (Prefetch data referenced by the WAL, take II), though I've
not bisected to be 100% sure.

Adding some debug printouts to ExecuteRecoveryCommand convinces me
that indeed the archive_cleanup_command is NOT getting called by the
problematic CHECKPOINT command. I surmise based on Andres' comment
above that the standby isn't making a restartpoint for lack of
an available primary checkpoint, which looks to me like it could be
a pre-existing bug in the test case: it's sure not doing anything to
guarantee that the primary's checkpoint record has reached the standby.

I tried adjusting the patch so it does guarantee that (as attached),
and in two out of two tries it got past the archive_cleanup_command
failure but then hung up waiting for standby2 to promote.

On the whole, I'm not sure that the WAL prefetch logic is noticeably
more stable than when we booted it out last year :-(. However, I also
wonder why it is that this test case wasn't occasionally failing already.

regards, tom lane

Attachment Content-Type Size
002_archiving-hack.patch text/x-diff 890 bytes

In response to

Responses

Browse pgsql-committers by date

  From Date Subject
Next Message Tom Lane 2022-04-08 22:14:29 pgsql: Silence compiler warnings for unsupported compression methods.
Previous Message Tom Lane 2022-04-08 18:55:27 pgsql: Improve frontend error logging style.

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Geoghegan 2022-04-08 21:56:37 Re: Lowering the ever-growing heap->pd_lower
Previous Message Peter Geoghegan 2022-04-08 21:43:31 Re: Lowering the ever-growing heap->pd_lower