| From: | Bryan Green <dbryan(dot)green(at)gmail(dot)com> |
|---|---|
| To: | pgsql-hackers <pgsql-hackers(at)postgresql(dot)org> |
| Subject: | [PATCH] Fix fragile walreceiver test. |
| Date: | 2025-11-05 06:03:29 |
| Message-ID: | 9d00b597-d64a-4f1e-802e-90f9dc394c70@gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
The recovery/004_timeline_switch test has been failing for me on
Windows. The test is wrong.
The test does this:
$node_standby_2->restart;
# ... timeline switch happens ...
ok( !$node_standby_2->log_contains(
"FATAL: .* terminating walreceiver process due to
administrator command"
),
'WAL receiver should not be stopped across timeline jumps');
Problem: restart() kills the walreceiver (as it should), which writes
that exact FATAL message to the log. The test then searches the log and
finds it.
The test has a comment claiming "a new log file is used on node
restart". TAP tests use pg_ctl with a fixed filename that gets reused
across restarts. No log rotation.
I added logging to confirm what's actually happening. The walreceiver
works correctly - same PID handles both timelines:
2025-11-04 23:05:28.539 CST walreceiver[83824] LOG: started
streaming WAL from primary at 0/03000000 on timeline 1
2025-11-04 23:05:28.543 CST startup[42764] LOG: new target timeline
is 2
2025-11-04 23:05:28.544 CST walreceiver[83824] LOG: restarted WAL
streaming at 0/03000000 on timeline 2
That's PID 83824 throughout. Works fine.
Earlier in the same log, from the restart:
2025-11-04 23:05:27.261 CST walreceiver[52440] FATAL: terminating
walreceiver process due to administrator command
Different PID (52440), expected shutdown. This is what the test finds.
The fix is obvious: check that the walreceiver PID stays constant.
That's what we actually care about anyway.
This matters because changes to I/O behavior elsewhere in the code can
make this test fail spuriously. I hit it while working on O_CLOEXEC
handling for Windows.
Patch attached.
--
Bryan Green
EDB: https://www.enterprisedb.com
| Attachment | Content-Type | Size |
|---|---|---|
| 0001-Fix-timing-dependent-failure-in-recovery-004_timelin.patch | text/plain | 3.0 KB |
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Paul A Jungwirth | 2025-11-05 06:18:40 | GiST README typos |
| Previous Message | Michael Paquier | 2025-11-05 05:44:39 | Re: [BUG] temporary file usage report with extended protocol and unnamed portals |