Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

From: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Daniel Gustafsson <daniel(at)yesql(dot)se>, "Anton A(dot) Melnikov" <aamelnikov(at)inbox(dot)ru>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)anarazel(dot)de>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"
Date: 2023-07-25 01:36:03
Message-ID: CA+hUKG+a+M6tbKJ5Ei2SFBDJxw4UjGLyRBDVrUfuSBZZ0ht0LQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Jul 25, 2023 at 8:18 AM Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> (Yeah, I know we have code to verify checksums during a base
> backup, but as discussed elsewhere, it doesn't work.)

BTW the the code you are referring to there seems to think 4KB
page-halves are atomic; not sure if that's imagining page-level
locking in ancient Linux (?), or imagining default setvbuf() buffer
size observed with some specific implementation of fread(), or
confusing power-failure-sector-based atomicity with concurrent access
atomicity, or something else, but for the record what we actually see
in this scenario on ext4 is the old/new page contents mashed together
on much smaller boundaries (maybe cache lines), caused by duelling
concurrent memcpy() to/from, independent of any buffer/page-level
implementation details we might have been thinking of with that code.
Makes me wonder if it's even technically sound to examine the LSN.

> It's also why we
> have to force full-page write on during a backup. But the whole thing
> is nasty because you can't really verify anything about the backup you
> just took. It may be full of gibberish blocks but don't worry because,
> if all goes well, recovery will fix it. But you won't really know
> whether recovery actually does fix it. You just kind of have to cross
> your fingers and hope.

Well, not without also scanning the WAL for FPIs, anyway... And
conceptually, that's why I think we probably want an 'FPI' of the
control file somewhere.

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Mr.Bim 2023-07-25 03:17:55 Partition pruning not working on updates
Previous Message Peter Geoghegan 2023-07-25 01:33:52 Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan