Re: [BUG] non archived WAL removed during production crash recovery

From: Jehan-Guillaume de Rorthais <jgdr(at)dalibo(dot)com>
To: Michael Paquier <michael(at)paquier(dot)xyz>
Cc: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, masao(dot)fujii(at)oss(dot)nttdata(dot)com, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: [BUG] non archived WAL removed during production crash recovery
Date: 2020-04-14 16:03:19
Message-ID: 20200414180319.731f9593@firost
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

On Mon, 13 Apr 2020 16:14:14 +0900
Michael Paquier <michael(at)paquier(dot)xyz> wrote:
> XLogCtl->XLogCacheBlck = XLOGbuffers - 1;
> - XLogCtl->SharedRecoveryInProgress = true;
> XLogCtl->SharedHotStandbyActive = false;
> XLogCtl->SharedPromoteIsTriggered = false;
> XLogCtl->WalWriterSleeping = false;
> [...]
> + switch (ControlFile->state)
> + {
> + XLogCtl->SharedRecoveryState = ARCHIVE_RECOVERING;
> + break;
> + default:
> + XLogCtl->SharedRecoveryState = CRASH_RECOVERING;
> + }
> It seems to me that the initial value of SharedRecoveryState should be
> CRASH_RECOVERING all the time no? StartupXLOG() is a code path taken
> even if starting cleanly, and the flag would be reset correctly at the

Previous version of the patch was setting CRASH_RECOVERING. Fujii-san reported
(the 4 Apr 2020 02:49:50 +0900) that it could be useful to expose a better value
until relevant code is reached in StartupXLOG() so GetRecoveryState() returns
a safer value for futur use.

As I answered upthread, I'm not sure who would need this information before the
WAL machinery is up though. Note that ARCHIVE_RECOVERING and CRASH_RECOVERING
are compatible with the previous behavior.

Maybe the solution would be to init with CRASH_RECOVERING and add some comment
in GetRecoveryState() to warn the state is "enforced" after the XLOG machinery
is started and is init'ed to RECOVERING in the meantime?

I initialized it to CRASH_RECOVERING in the new v5 patch and added a comment
to GetRecoveryState().

> +typedef enum RecoveryState
> +{
> + NOT_IN_RECOVERY = 0, /* currently in production */
> + CRASH_RECOVERING, /* recovering from a crash */
> + ARCHIVE_RECOVERING /* recovering archives as requested */
> +} RecoveryState;
> I also have some issues with the name of those variables, here is an
> idea for the three states:
> I would recommend to use the same suffix for those variables to ease
> grepping.

Sounds really good to me. Thanks!

> /*
> - * Local copy of SharedRecoveryInProgress variable. True actually means "not
> - * known, need to check the shared state".
> + * This is false when SharedRecoveryState is not ARCHIVE_RECOVERING.
> + * True actually means "not known, need to check the shared state".
> */
> A double negation sounds wrong to me. And actually this variable is
> false when the shared state is set to NOT_IN_RECOVERY, and true when
> the state is either CRASH_RECOVERING or ARCHIVE_RECOVERING because it
> means that recovery is running, be it archive recovery or crash
> recovery, so the comment is wrong.

Indeed, sorry. Fixed.

> + /* The file is always deletable if archive_mode is "off". */
> + if ( ! XLogArchivingActive() )
> + return true;
> [...]
> + if ( recoveryState == ARCHIVE_RECOVERING && !XLogArchivingAlways() )
> return true;
> Incorrect indentation.

Is it the spaces as reported by Horiguchi-san? I removed them in latest patch.

> UpdateControlFile();
> +
> + SpinLockAcquire(&XLogCtl->info_lck);
> + XLogCtl->SharedRecoveryState = ARCHIVE_RECOVERING;
> + SpinLockRelease(&XLogCtl->info_lck);
> +
> LWLockRelease(ControlFileLock);
> Here, the shared flag is updated while holding ControlFileLock to
> ensure a consistent state of things within shared memory, so it would
> be nice to add a comment about that.

Indeed. the original code had no such comment and I asked myself the same
question. Added.

> +RecoveryState
> +GetRecoveryState(void)
> +{
> + volatile XLogCtlData *xlogctl = XLogCtl;
> +
> + return xlogctl->SharedRecoveryState;
> +}
> Er, you need to acquire info_lck to look at the state here, no?

Yes, fixed.

> /*
> - * The file is always deletable if archive_mode is "off". On standbys
> - * archiving is disabled if archive_mode is "on", and enabled with
> - * "always". On a primary, archiving is enabled if archive_mode is "on"
> - * or "always".
> + * On standbys, the file is deletable if archive_mode is not
> + * "always".
> */
> It would be good to mention that while in crash recovery, files are
> not considered as deletable.

Well, in fact, I am still wondering about this. I was hesitant to add a
shortcut like:

/* no WAL segment cleanup during crash recovery */
if (recoveryState == RECOVERT_STATE_CRASH)
return false;

But, what if for example we crashed for lack of disk space during intensive
wirtes? During crash recovery, any WAL marked as .done could be removed and
allow the system to start again and maybe make even further WAL cleanup by
archiving some more WAL without competing with previous high write ratio.

When we recover a primary, this behavior seems conform with any value of
archive_mode, even if it has been changed after crash and before starting it.
On a standby, we might create .ready files, but they will be removed during the
first restartpoint if needed.

> I agree that having a separate .pl file for the tests of this thread
> is just cleaner. Here are more comments about these.
> +# temporary fail archive_command for futur tests
> +$node->safe_psql('postgres', q{
> + ALTER SYSTEM SET archive_command TO 'false';
> + SELECT pg_reload_conf();
> +});
> That's likely portable on Windows even if you skip the tests there,
> and I am not sure that it is a good idea to rely on it being in PATH.
> Depending on the system, the path of the command is also likely going
> to be different. As here the goal is to prevent the archiver to do
> its work, why not relying on the configuration where archive_mode is
> set but archive_command is not? This would cause the archiver to be a
> no-op process, and .ready files will remain around. You could then
> replace the lookup of pg_stat_archiver with poll_query_until() and a
> query that makes use of pg_stat_file() to make sure that the .ready
> exists when needed.

I needed a failure so I can test pg_stat_archiver reports it as well. In my
mind, using "false" would either trigger a failure because false returns 1 or...
a failure because the command is not found. In either way, the result is the

Using poll_query_until+pg_stat_file, is a good idea, but not enough as
archiver reports a failure some moment after the .ready signal file appears.

> + ok( -f "$node_data/pg_wal/archive_status/000000010000000000000001.ready",
> + "WAL still ready to archive in archive_status");
> It would be good to mention in the description the check applies to a
> primary.


> +# test recovery without archive_mode=always does not keep .ready WALs
> +$standby1 = get_new_node('standby');
> +$standby1->init_from_backup($node, 'backup', has_restoring => 1);
> +$standby1_data = $standby1->data_dir;
> +$standby1->start;
> +$standby1->safe_psql('postgres', q{CHECKPOINT});
> For readability archive_mode = on should be added to the configuration
> file? Okay, this is inherited from the primary, still that would
> avoid any issues with this code is refactored in some way.


> "WAL waiting to be archived in backup removed with archive_mode=on
> on
> standby" ); That should be "WAL segment" or "WAL file", but not WAL.

updated everywhere.

> Regarding the tests on a standby, it seems to me that the following
> is necessary:
> 1) Test that with archive_mode = on, segments are not marked with
> .ready.
> 2) Test that with archive_mode = always, segments are marked with
> .ready during archive recovery.
> 3) Test that with archive_mode = always, segments are not removed
> during crash recovery.
> I can see tests for 1) and 2),

Not really. The current tests does not check that segments created *after* the
backup are marked or not with .ready file on standby. I added these tests plus
some various other ones.

> but not 3). Could you add a
> stop('immediate')+start() for $standby2 at the end of
> and check that the .ready file is still there
> after crash recovery?


> The end of the tests actually relies on the
> fact that archive_command is set to "false" when the cold backup is
> taken, before resetting it. I think that it would be cleaner to
> enforce the configuration you want to test before starting each
> standby. It becomes easier to understand the flow of the test for the
> reader.

done as well.

Thank you for your review!

Attachment Content-Type Size
0001-v5-Fix-WAL-retention-during-production-crash-recovery.patch text/x-patch 7.0 KB
0002-v4-Add-tests-relative-to-WAL-archiving.patch text/x-patch 7.4 KB

In response to


Browse pgsql-bugs by date

  From Date Subject
Next Message Jehan-Guillaume de Rorthais 2020-04-14 16:09:23 Re: [BUG] non archived WAL removed during production crash recovery
Previous Message Tom Lane 2020-04-14 15:48:52 Re: Buffer overflow when continuously send SIGHUP to postgres

Browse pgsql-hackers by date

  From Date Subject
Next Message Jehan-Guillaume de Rorthais 2020-04-14 16:09:23 Re: [BUG] non archived WAL removed during production crash recovery
Previous Message Tom Lane 2020-04-14 16:03:00 Re: Poll: are people okay with function/operator table redesign?