Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

From: Stephen Frost <sfrost(at)snowman(dot)net>
To: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc: Daniel Gustafsson <daniel(at)yesql(dot)se>, "Anton A(dot) Melnikov" <aamelnikov(at)inbox(dot)ru>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)anarazel(dot)de>, David Steele <david(at)pgmasters(dot)net>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"
Date: 2023-07-24 18:04:37
Message-ID: ZL69NXjCNG+WHCqG@tamriel.snowman.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Greetings,

(Adding David Steele into the CC on this one...)

* Thomas Munro (thomas(dot)munro(at)gmail(dot)com) wrote:
> This is a frustrating thread, because despite the last patch solving
> most of the problems we discussed, it doesn't address the
> low-level-backup procedure in a nice way. We'd have to tell users
> they have to flock that file, or add a new step "pg_controldata --raw
> > pg_control", which seems weird when they already have a connection
> to the server.

flock'ing the file strikes me as dangerous to ask external tools to do
due to the chances of breaking the running system if they don't do it
right. I'm generally open to the idea of having the backup tool have to
do more complicated work to be correct but that just seems likely to
cause issues. Also- haven't looked yet, but I'm not sure that's even
possible if your backup tool is running as a user who only has read
access to the data directory? I don't want us to give up on that
feature.

> Maybe it just doesn't matter if eg the pg_controldata program can
> spuriously fail if pointed at a running system, and I was being too
> dogmatic trying to fix even that. Maybe we should just focus on
> fixing backups. Even there, I am beginning to suspect we are solving
> this problem in the wrong place when a higher level change could
> simplify the problem away.

For a running system.. perhaps pg_controldata should be connecting to
the database and calling functions there? Or just complain that the
system is online and tell the user to do that?

> Idea for future research: Perhaps pg_backup_stop()'s label-file
> output should include the control file image (suitably encoded)? Then
> the recovery-from-label code could completely ignore the existing
> control file, and overwrite it using that copy. It's already
> partially ignoring it, by using the label file's checkpoint LSN
> instead of the control file's. Perhaps the captured copy could
> include the correct LSN already, simplifying that code, and the low
> level backup procedure would not need any additional steps or caveats.
> No more atomicity problem for low-level-backups... but probably not
> something we would back-patch, for such a rare failure mode.

I like this general direction and wonder if we could possibly even push
a bit harder on it: have the backup_label include the control file's
contents in some form that is understood and then tell tools to *not*
copy the running pg_control file ... and maybe even complain if a
pg_control file exists when we detect that backup_label has the control
file's image. We've certainly had problems in the past where people
would just nuke the backup_label file, even though they were restoring
from a backup, because they couldn't start the system up since their
restore command wasn't set up properly or their WAL archive was missing.

Being able to get rid of the control file being in the backup at all
would make it harder for someone to get to a corrupted-but-running
system and that seems like it's a good thing.

> Here's a new minimal patch that solves only the bugs in basebackup +
> the simple SQL-facing functions that read the control file, by simply
> acquiring ControlFileLock in the obvious places. This should be
> simple enough for back-patching?

I don't particularly like fixing this in a way that only works for
pg_basebackup and means that the users of other backup tools don't have
a solution to this problem. What are we supposed to tell users of
pgBackRest when they see this fix for pg_basebackup in the next set of
minor releases and they ask us if we've addressed this risk?

We might be able to accept the 're-read on CRC failure' approach, if it
were being used for pg_controldata and we documented that external
tools and especially backup tools using the low-level API are required
to check the CRC and to re-read on failure if accessing a running
system.

While it's not ideal, maybe we could get away with changing the contents
of the backup_label as part of a back-patch? The documentation, at
least on a quick look, says to copy the second field from pg_backup_stop
into a backup_label file but doesn't say anything about what those
contents are or if they can change. That would at least address the
concern of backups ending up with a corrupt pg_control file and not
being able to be restored, even if tools aren't updated to verify the
CRC or similar. Of course, that's a fair bit of code churn for a
backpatch, which I certainly understand as being risky. If we can't
back-patch that, it might still be the way to go moving forward, while
also telling tools to check the CRC. (I'm not going to try to figure
out some back-patchable pretend solution for this for shell scripts that
pretend to be able to backup running PG databases; this issue is
probably the least of their issues anyway...)

A couple of interesting notes on this though- pgBackRest doesn't only
read the pg_control file at backup time, we also check it at
archive_command time, to make sure that the system ID and version that
are in the control file match up with the information in the WAL file
we're getting ready to archive and that those match with the system ID
and version of the repo/stanza into which we are pushing the WAL file.
We do read the control file on the replica but that isn't the one we
actually push into the repo as part of a backup- that's always the one
we read from the primary (we don't currently support 'backup just from
the replica').

Coming out of our discussion regarding this, we're likely to move
forward on the check-CRC-and-re-read approach for the next pgBackRest
release. If PG provides a better solution for us to use, great, but
given that this has been shown to happen, we're not intending to wait
around for PG to provide us with a better fix.

Thanks,

Stephen

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Fabrízio de Royes Mello 2023-07-24 18:07:07 Re: Removing the fixed-size buffer restriction in hba.c
Previous Message Tom Lane 2023-07-24 17:58:45 Re: Use of additional index columns in rows filtering