hot standby startup, visibility map, clog

From: Daniel Farina <daniel(at)heroku(dot)com>
To: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: hot standby startup, visibility map, clog
Date: 2011-06-09 09:14:01
Message-ID: BANLkTinCXfATbbPdXpz1OMW9A1Terg9hLQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello list,

A little while ago time ago I posted about how my ... exciting ....
backup procedure caused occasional problems starting due to clog not
being big enough.
(http://archives.postgresql.org/pgsql-hackers/2011-04/msg01148.php) I
recently had a reproduction and a little bit of luck, and I think I
have a slightly better idea of what may be causing this.

The first fact is that turning off hot standby will let the cluster
start up, but only after seeing a spate of messages like these (dozen
or dozens, not thousands):

2011-06-09 08:02:32 UTC LOG: restored log file
"000000020000002C000000C0" from archive
2011-06-09 08:02:33 UTC WARNING: xlog min recovery request
2C/C1F09658 is past current point 2C/C037B278
2011-06-09 08:02:33 UTC CONTEXT: writing block 0 of relation
base/16385/16784_vm
xlog redo insert: rel 1663/16385/128029; tid 114321/63
2011-06-09 08:02:33 UTC LOG: restartpoint starting: xlog

Most importantly, *all* such messages are in visibility map forks
(_vm). I reasonably confident that my code does not start reading
data until pg_start_backup() has returned, and blocks on
pg_stop_backup() after having read all the data. Also, the mailing
list correspondence at
http://archives.postgresql.org/pgsql-hackers/2010-11/msg02034.php
suggests that the visibility map is not flushed at checkpoints, so
perhaps with some poor timing an old page can wander onto disk even
after a checkpoint barrier that pg_start_backup waits for. (I have not
yet found the critical section that makes visibilitymap buffers immune
to checkpoint though).

Given all that, if the smgr's generic read path that checks the LSN
and possibly the clog (but apparently only in hot standby mode, since
pre-hot-standby the clog's intermediate states were not so
interesting...) has a problem with such uncheckpointed pages, then it
would seem reasonable that the system refuses to start vs. the way it
once did.

FWIW, letting recovery run without hot standby for a little while,
canceling, and then starting again after the danger zone had passed
would allow recovery to proceed correctly, as one might expect.

Thoughts?

--
fdr

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Pavan Deolasee 2011-06-09 09:26:02 Re: Autoanalyze and OldestXmin
Previous Message Simon Riggs 2011-06-09 09:09:23 Re: reducing the overhead of frequent table locks - now, with WIP patch