From: | "Maksim(dot)Melnikov" <m(dot)melnikov(at)postgrespro(dot)ru> |
---|---|
To: | pgsql-hackers(at)lists(dot)postgresql(dot)org |
Subject: | Incorrect checksum in control file with pg_rewind test |
Date: | 2025-09-04 15:18:30 |
Message-ID: | f59335a4-83ff-438a-a30e-7cf2200276b6@postgrespro.ru |
Views: | Whole Thread | Raw Message | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi, hackers!
I've got test failure for pg_rewind tests and it seems we have
read/write races
for pg_control file. The test error is incorrect checksum in control file.
Build was compiled with -DEXEC_BACKEND flag.
# +++ tap check in src/bin/pg_rewind +++
Bailout called. Further testing stopped: pg_ctl start failed
t/001_basic.pl ...............
Dubious, test returned 255 (wstat 65280, 0xff00)
All 20 subtests passed
2025-05-07 15:00:39.353 MSK [2002308] LOG: starting backup recovery
with redo LSN 0/2000028, checkpoint LSN 0/2000070, on timeline ID 1
2025-05-07 15:00:39.354 MSK [2002307] FATAL: incorrect checksum in
control file
2025-05-07 15:00:39.354 MSK [2002308] LOG: redo starts at 0/2000028
2025-05-07 15:00:39.354 MSK [2002308] LOG: completed backup recovery
with redo LSN 0/2000028 and end LSN 0/2000138
2025-05-07 15:00:39.354 MSK [2002301] LOG: background writer process
(PID 2002307) exited with exit code 1
2025-05-07 15:00:39.354 MSK [2002301] LOG: terminating any other active
server processes
2025-05-07 15:00:39.355 MSK [2002301] LOG: shutting down because
restart_after_crash is off
2025-05-07 15:00:39.356 MSK [2002301] LOG: database system is shut down
# No postmaster PID for node "primary_remote"
[15:00:39.438](0.238s) Bail out! pg_ctl start failed
Failure occurred during restart the primary node to check that rewind
went correctly.
Error is very rare and difficult to reproduce.
It seems we have race between process that replays WAL on start and
update control
file and other sub-processes that read control file and were started
with exec.
As the result sub-processes can read partially updated file with
incorrect crc.
The reason is that LocalProcessControlFile don't acquire ControlFileLock
and it
can't do it.
I found thread
https://www.postgresql.org/message-id/flat/20221123014224.xisi44byq3cf5psi%40awork3.anarazel.de,
where the similiar issue was discussed for frontend programs. The
decision was
to retry control file read in case of crc failures. Details can be found
in commit
5725e4ebe7a936f724f21e7ee1e84e54a70bfd83. My suggestion is to use this
approach
here. Patch is attached.
Best regards,
Maksim Melnikov
Attachment | Content-Type | Size |
---|---|---|
v1-0001-Try-to-handle-torn-reads-of-pg_control-in-sub-pos.patch | text/x-patch | 2.3 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | jian he | 2025-09-04 15:20:42 | Re: NOT NULL NOT ENFORCED |
Previous Message | Yugo Nagata | 2025-09-04 15:03:37 | Re: Inconsistent update in the MERGE command |