| From: | Xuneng Zhou <xunengzhou(at)gmail(dot)com> |
|---|---|
| To: | Michael Paquier <michael(at)paquier(dot)xyz>, dbryan(dot)green(at)gmail(dot)com |
| Cc: | pgsql-hackers <pgsql-hackers(at)postgresql(dot)org> |
| Subject: | Re: [PATCH] Fix fragile walreceiver test. |
| Date: | 2025-11-05 07:30:30 |
| Message-ID: | CABPTF7WOqRKu6Agb6rFgZZ=X2mtaVbM7G=GMYVBx=F2L2TPQsw@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
Hi,
On Wed, Nov 5, 2025 at 2:50 PM Michael Paquier <michael(at)paquier(dot)xyz> wrote:
>
> On Wed, Nov 05, 2025 at 12:03:29AM -0600, Bryan Green wrote:
> > Problem: restart() kills the walreceiver (as it should), which writes
> > that exact FATAL message to the log. The test then searches the log and
> > finds it.
>
> Timing issue then, the buildfarm has not been complaining on this one
> AFAIK, there have been no recoveryCheck failures reported:
> https://buildfarm.postgresql.org/cgi-bin/show_failures.pl
>
> > The test has a comment claiming "a new log file is used on node
> > restart". TAP tests use pg_ctl with a fixed filename that gets reused
> > across restarts. No log rotation.
>
> I've fat-fingered this assumption, indeed, missing that one would need
> to do an extra rotate_logfile() before the restart.
>
> > The fix is obvious: check that the walreceiver PID stays constant.
> > That's what we actually care about anyway.
Thanks for catching and analyzing this!
> Hmm. The reason why I didn't use a PID matching check (mentioned at
> [1]) is that this is not entirely bullet-proof. On a very slow
> machine, one could assume that standby_1 generates some records and
> that these are replayed by standby_2 *before* the PID of the WAL
> receiver is retrieved. This could lead to false positives in some
> cases, and a bunch of buildfarm members are very slow. You have a
> point that these would unlikely happen in normal runs, so a PID
> matching check would be relevant most of the time anyway, even if the
> original PID has been fetched after the TLI jump has been processed in
> standby_2. I'd rather keep the log check, TBH, bypassing it with an
> extra rotate_logfile() before the restart of standby_2.
I’ve also prepared a patch for this method.
> > This matters because changes to I/O behavior elsewhere in the code can
> > make this test fail spuriously. I hit it while working on O_CLOEXEC
> > handling for Windows.
>
> Fun. And the WAL receiver never stops after the restart of standby_2
> with the log entry present in the server logs generated before the
> restart, right?
>
Best,
Xuneng
| Attachment | Content-Type | Size |
|---|---|---|
| v2-0001-Fix-timing-issue-in-recovery-004_timeline_switch-.patch | application/octet-stream | 2.1 KB |
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Nishant Sharma | 2025-11-05 07:32:07 | Re: [PATCH] Add pg_get_tablespace_ddl() function to reconstruct CREATE TABLESPACE statement |
| Previous Message | Antonin Houska | 2025-11-05 07:12:19 | Re: Adding REPACK [concurrently] |