Re: spurious(?) warnings in archive recovery

From: Vik Fearing <vik(dot)fearing(at)2ndquadrant(dot)com>
To: Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: spurious(?) warnings in archive recovery
Date: 2018-11-18 23:57:00
Message-ID: 153eb917-df68-ec4f-c4ee-51c0c8f45608@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 13/11/2018 16:34, Andrew Gierth wrote:
> So while investigating a case of this warning (in
> UpdateMinRecoveryPoint):
>
> "xlog min recovery request %X/%X is past current point %X/%X"
>
> I noticed that it is issued even in cases where we know that
> minRecoveryPoint is not yet valid, for example because we're waiting to
> see XLOG_BACKUP_END before declaring consistency.
>
> But, you'd think, you shouldn't get this error because any page we
> modify during recovery should have been restored from an FPI with a
> suitably early LSN? For data pages that is correct, but not for VM or
> (iff wal_log_hints or checksums are enabled) FSM pages.
>
> When we replay an operation that, for example, clears a bit in the VM,
> the redo code will read in that VM page from disk, and because we're not
> yet consistent and because _clearing_ a VM bit is not in itself
> wal-logged and doesn't result in any FPI being generated for the VM
> page, it could well read a VM page that has a far-future LSN from the
> point of view of replay, and dirty it, causing a later eviction to try
> and do UpdateMinRecoveryPoint with that future LSN.
>
> (I haven't investigated this aspect, but there also appears to be no
> protection against torn pages in the VM when checksums are enabled? am I
> missing something somewhere?)
>
> I'm less clear on the exact mechanisms, but when wal_log_hints (or
> checksums) is on, FSM pages also get LSNs, sometimes, thanks to
> MarkBufferDirtyHint, and at least some code paths can also do
> MarkBufferDirty on FSM pages during recovery, which would cause their
> eviction with possible future LSNs as with VM pages.
>
> This means that if you simply do an old-style base backup using
> pg_start_backup/rsync/pg_stop_backup (on a sufficiently active system
> and taking long enough) and then recover from it, you're likely to get a
> log spammed with these errors for no very good reason.
>
> So it seems to me that issuing this error is a bug if the conditions
> described are actually harmless, while if they're not harmless, then
> obviously that is a bug. So _something_ needs fixing here, but I'm not
> yet sufficiently confident of my analysis to say what.
>
> Opinions?
>
> (as a further point, it seems to me that backupEndRequired is a rather
> misleadingly named variable, since what _actually_ determines whether an
> XLOG_BACKUP_END record is expected is whether backupStartPoint is set.
> backupEndRequired seems to change one error message and, questionably,
> one decision about whether to do crash recovery before entering archive
> recovery, but nothing else.)

Bump.

I was the originator of this report. I work with Postgres every single
day and I was spooked by these warnings. People with much less
involvement would probably be terrified.
--
Vik Fearing +33 6 46 75 15 36
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message 范孝剑 (康贤) 2018-11-19 01:26:50 Can I skip function ResolveRecoveryConflictWithSnapshot if setting hot_standby_feedback=on all the time
Previous Message Haribabu Kommi 2018-11-18 23:41:22 Re: New function pg_stat_statements_reset_query() to reset statistics of a specific query