Re: "using previous checkpoint record at" maybe not the greatest idea?

From: "David G(dot) Johnston" <david(dot)g(dot)johnston(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: "using previous checkpoint record at" maybe not the greatest idea?
Date: 2016-02-02 00:29:39
Message-ID: CAKFQuwa+YTwiPXpqMy3gUseE=v9JmEV0GAF9ToikH6-Ns-rKtQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Feb 1, 2016 at 4:58 PM, Andres Freund <andres(at)anarazel(dot)de> wrote:

> Hi,
>
> currently if, when not in standby mode, we can't read a checkpoint
> record, we automatically fall back to the previous checkpoint, and start
> replay from there.
>
> Doing so without user intervention doesn't actually seem like a good
> idea. While not super likely, it's entirely possible that doing so can
> wreck a cluster, that'd otherwise easily recoverable. Imagine e.g. a
> tablespace being dropped - going back to the previous checkpoint very
> well could lead to replay not finishing, as the directory to create
> files in doesn't even exist.

> As there's, afaics, really no "legitimate" reasons for needing to go
> back to the previous checkpoint I don't think we should do so in an
> automated fashion.
>
> All the cases where I could find logs containing "using previous
> checkpoint record at" were when something else had already gone pretty
> badly wrong. Now that obviously doesn't have a very large significance,
> because in the situations where it "just worked" are unlikely to be
> reported...
>
> Am I missing a reason for doing this by default?
>

​Learning by reading here...

http://www.postgresql.org/docs/current/static/wal-internals.html
"""
​After a checkpoint has been made and the log flushed, the checkpoint's
position is saved in the file pg_control. Therefore, at the start of
recovery, the server first reads pg_control and then the checkpoint record;
then it performs the REDO operation by scanning forward from the log
position indicated in the checkpoint record. Because the entire content of
data pages is saved in the log on the first page modification after a
checkpoint (assuming full_page_writes is not disabled), all pages changed
since the checkpoint will be restored to a consistent state.

To deal with the case where pg_control is corrupt, we should support the
possibility of scanning existing log segments in reverse order — newest to
oldest — in order to find the latest checkpoint. This has not been
implemented yet. pg_control is small enough (less than one disk page) that
it is not subject to partial-write problems, and as of this writing there
have been no reports of database failures due solely to the inability to
read pg_control itself. So while it is theoretically a weak spot,
pg_control does not seem to be a problem in practice.
​"""​

​The above comment appears out-of-date if this post describes what
presently happens.

Also, I was​ under the impression that tablespace commands resulted in
checkpoints so that the state of the file system could be presumed
current...

I don't know enough internals but its seems like we'd need to distinguish
between an interrupted checkpoint (pull the plug during checkpoint) and one
that supposedly completed without interruption but then was somehow
corrupted (solar flares). The former seem legitimate for auto-skip while
the later do not.

David J.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2016-02-02 00:43:34 checkpoints after database start/immediate checkpoints
Previous Message David Steele 2016-02-02 00:24:48 Re: [PROPOSAL] Client Log Output Filtering