Re: [PATCH] Fix fragile walreceiver test.

From: Xuneng Zhou <xunengzhou(at)gmail(dot)com>
To: Michael Paquier <michael(at)paquier(dot)xyz>, dbryan(dot)green(at)gmail(dot)com
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PATCH] Fix fragile walreceiver test.
Date: 2025-11-05 07:30:30
Message-ID: CABPTF7WOqRKu6Agb6rFgZZ=X2mtaVbM7G=GMYVBx=F2L2TPQsw@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On Wed, Nov 5, 2025 at 2:50 PM Michael Paquier <michael(at)paquier(dot)xyz> wrote:
>
> On Wed, Nov 05, 2025 at 12:03:29AM -0600, Bryan Green wrote:
> > Problem: restart() kills the walreceiver (as it should), which writes
> > that exact FATAL message to the log. The test then searches the log and
> > finds it.
>
> Timing issue then, the buildfarm has not been complaining on this one
> AFAIK, there have been no recoveryCheck failures reported:
> https://buildfarm.postgresql.org/cgi-bin/show_failures.pl
>
> > The test has a comment claiming "a new log file is used on node
> > restart". TAP tests use pg_ctl with a fixed filename that gets reused
> > across restarts. No log rotation.
>
> I've fat-fingered this assumption, indeed, missing that one would need
> to do an extra rotate_logfile() before the restart.
>
> > The fix is obvious: check that the walreceiver PID stays constant.
> > That's what we actually care about anyway.

Thanks for catching and analyzing this!

> Hmm. The reason why I didn't use a PID matching check (mentioned at
> [1]) is that this is not entirely bullet-proof. On a very slow
> machine, one could assume that standby_1 generates some records and
> that these are replayed by standby_2 *before* the PID of the WAL
> receiver is retrieved. This could lead to false positives in some
> cases, and a bunch of buildfarm members are very slow. You have a
> point that these would unlikely happen in normal runs, so a PID
> matching check would be relevant most of the time anyway, even if the
> original PID has been fetched after the TLI jump has been processed in
> standby_2. I'd rather keep the log check, TBH, bypassing it with an
> extra rotate_logfile() before the restart of standby_2.

I’ve also prepared a patch for this method.

> > This matters because changes to I/O behavior elsewhere in the code can
> > make this test fail spuriously. I hit it while working on O_CLOEXEC
> > handling for Windows.
>
> Fun. And the WAL receiver never stops after the restart of standby_2
> with the log entry present in the server logs generated before the
> restart, right?
>

Best,
Xuneng

Attachment Content-Type Size
v2-0001-Fix-timing-issue-in-recovery-004_timeline_switch-.patch application/octet-stream 2.1 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Nishant Sharma 2025-11-05 07:32:07 Re: [PATCH] Add pg_get_tablespace_ddl() function to reconstruct CREATE TABLESPACE statement
Previous Message Antonin Houska 2025-11-05 07:12:19 Re: Adding REPACK [concurrently]