Re: "using previous checkpoint record at" maybe not the greatest idea?

From: "David G(dot) Johnston" <david(dot)g(dot)johnston(at)gmail(dot)com>
To: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: "using previous checkpoint record at" maybe not the greatest idea?
Date: 2016-02-04 23:09:49
Message-ID: CAKFQuwasfkwfhXB37hvjWK1G=cv8Aogun3tDCYEj9FbPNZZ8wQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Feb 4, 2016 at 3:57 PM, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
wrote:

> David G. Johnston wrote:
>
> > ​Learning by reading here...
> >
> > http://www.postgresql.org/docs/current/static/wal-internals.html
> > """
> > ​After a checkpoint has been made and the log flushed, the checkpoint's
> > position is saved in the file pg_control. Therefore, at the start of
> > recovery, the server first reads pg_control and then the checkpoint
> record;
> > then it performs the REDO operation by scanning forward from the log
> > position indicated in the checkpoint record. Because the entire content
> of
> > data pages is saved in the log on the first page modification after a
> > checkpoint (assuming full_page_writes is not disabled), all pages changed
> > since the checkpoint will be restored to a consistent state.
> >
> > To deal with the case where pg_control is corrupt, we should support the
> > possibility of scanning existing log segments in reverse order — newest
> to
> > oldest — in order to find the latest checkpoint. This has not been
> > implemented yet. pg_control is small enough (less than one disk page)
> that
> > it is not subject to partial-write problems, and as of this writing there
> > have been no reports of database failures due solely to the inability to
> > read pg_control itself. So while it is theoretically a weak spot,
> > pg_control does not seem to be a problem in practice.
> > ​"""​
> >
> > ​The above comment appears out-of-date if this post describes what
> > presently happens.
>
> I think you're misinterpreting Andres, or the docs, or both.
>
> What Andres says is that the control file (pg_control) stores two
> checkpoint locations: the latest one, and the one before that. When
> recovery occurs, it starts by looking up the latest checkpoint record;
> if it cannot find that for whatever reason, it falls back to reading the
> previous one. (He further claims that falling back to the previous one
> is a bad idea.)
>
> What the 2nd para in the documentation is saying is something different:
> it is talking about reading all the pg_xlog files (in reverse order),
> which is not pg_control, and see what checkpoint records are there, then
> figure out which one to use.
>

Yes, I inferred something that obviously isn't true - that the system
doesn't go hunting for a valid checkpoint to begin recovery from. While it
does not do so in the case of a corrupted pg_control file I further assumed
it never did. That would be because the documentation doesn't make the
point of stating that two checkpoint positions exist and that PostgreSQL
will try the second one if the first one proves unusable. Given the topic
of this thread that omission makes the documentation out-of-date. Maybe
its covered elsewhere but since this section addresses locating a starting
point I would expect any such description ​to be here as well.

David J.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2016-02-04 23:10:33 Re: insufficient qualification of some objects in dump files
Previous Message Alvaro Herrera 2016-02-04 22:57:43 Re: "using previous checkpoint record at" maybe not the greatest idea?