Re: recovery_init_sync_method=wal

From: Stephen Frost <sfrost(at)snowman(dot)net>
To: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: recovery_init_sync_method=wal
Date: 2021-03-21 15:31:43
Message-ID: 20210321153143.GF20766@tamriel.snowman.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Greetings,

* Thomas Munro (thomas(dot)munro(at)gmail(dot)com) wrote:
> 2. You made a file system-level copy of a cluster that you shut down
> cleanly first, using cp, tar, scp, rsync, xmodem etc. Now you start
> up the copy. Its checkpoint is a forgery. (Maybe our manual should
> mention this problem under "25.2. File System Level Backup" where it
> teaches you to rsync your cluster.)

Yes, it'd be good to get some updates to the backup documentation around
this which stresses in all cases that your backup utility should make
sure to fsync everything it restores.

> These are both scatter gun approaches that can sometimes do a lot of
> useless work, and I'd like to find a precise version that uses
> information we already have about what might be dirty according to the
> meaning of a checkpoint and a transaction log. The attached patch
> does that as follows:
>
> 1. Sync the WAL using fsync(), to enforce the log-before-data rule.
> That's moved into the existing loop that scans the WAL files looking
> for temporary files to unlink. (I suppose it should perhaps do the
> "presync" trick too. Not done yet.)
>
> 2. While replaying the WAL, if we ever decide to skip a page because
> of its LSN, remember to fsync() the file in the next checkpoint anyway
> (because the data might be dirty in the kernel). This way we sync
> all files that changed since the last checkpoint (even if we don't
> have to apply the change again). (A more paranoid mode would mark the
> page dirty instead, so that we'll not only fsync() it, but we'll also
> write it out again. This would defend against kernels that have
> writeback failure modes that include keeping changes but dropping
> their own dirty flag. Not done here.)

Presuming that we do add to the documentation the language to document
what's assumed (and already done by modern backup tools) that they're
fsync'ing everything they're restoring, do we/can we have an option
which those tools could set that explicitly tells PG "everything in the
cluster has been fsync'd already, you don't need to do anything extra"?
Perhaps also/seperately one for WAL that's restored with restore command
if we think that's necessary?

Otherwise, just in general, agree with doing this to address the risks
discussed around regular crash recovery. We have some pretty clear "if
the DB was doing recovery and was interrupted, you need to restore from
backup" messages today in xlog.c, and this patch didn't seem to change
that? Am I missing something or isn't the idea here that these changes
would make it so you aren't going to end up with corruption in those
cases? Specifically looking at-

xlog.c:6509-
case DB_IN_CRASH_RECOVERY:
ereport(LOG,
(errmsg("database system was interrupted while in recovery at %s",
str_time(ControlFile->time)),
errhint("This probably means that some data is corrupted and"
" you will have to use the last backup for recovery.")));
break;

case DB_IN_ARCHIVE_RECOVERY:
ereport(LOG,
(errmsg("database system was interrupted while in recovery at log time %s",
str_time(ControlFile->checkPointCopy.time)),
errhint("If this has occurred more than once some data might be corrupted"
" and you might need to choose an earlier recovery target.")));
break;

Thanks!

Stephen

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Pavel Stehule 2021-03-21 15:33:41 Re: pl/pgsql feature request: shorthand for argument and local variable references
Previous Message Tom Lane 2021-03-21 14:53:17 Re: [PATCH] proposal for regexp_count, regexp_instr, regexp_substr and regexp_replace