Re: Hot Backup with rsync fails at pg_clog if under load

From: Daniel Farina <daniel(at)heroku(dot)com>
To: Chris Redekop <chris(at)replicon(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Hot Backup with rsync fails at pg_clog if under load
Date: 2011-10-23 20:48:06
Message-ID: CAAZKuFbUSxOb8x80k9-fD9CsXBRmzTNWutqcDVGaTvwFTvURiQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Oct 17, 2011 at 11:30 PM, Chris Redekop <chris(at)replicon(dot)com> wrote:
> Well, on the other hand maybe there is something wrong with the data.
>  Here's the test/steps I just did -
> 1. I do the pg_basebackup when the master is under load, hot slave now will
> not start up but warm slave will.
> 2. I start a warm slave and let it catch up to current
> 3. On the slave I change 'hot_standby=on' and do a 'service postgresql
> restart'
> 4. The postgres fails to restart with the same error.
> 5. I turn hot_standby back off and postgres starts back up fine as a warm
> slave
> 6. I then turn off the load, the slave is all caught up, master and slave
> are both sitting idle
> 7. I, again, change 'hot_standby=on' and do a service restart
> 8. Again it fails, with the same error, even though there is no longer any
> load.
> 9. I repeat this warmstart/hotstart cycle a couple more times until to my
> surprise, instead of failing, it successfully starts up as a hot standby
> (this is after maybe 5 minutes or so of sitting idle)
> So...given that it continued to fail even after the load had been turned of,
> that makes me believe that the data which was copied over was invalid in
> some way.  And when a checkpoint/logrotation/somethingelse occurred when not
> under load it cleared itself up....I'm shooting in the dark here
> Anyone have any suggestions/ideas/things to try?

Having digged at this a little -- but not too much -- the problem
seems to be that postgres is reading the commit logs way, way too
early, that is to say, before it has played enough WAL to be
'consistent' (the WAL between pg_start and pg_stop backup). I have
not been able to reproduce this problem (I think) after the message
from postgres suggesting it has reached a consistent state; at that
time I am able to go into hot-standby mode.

The message is like: "consistent recovery state reached at %X/%X".
(this is the errmsg)

It doesn't seem meaningful for StartupCLOG (or, indeed, any of the
hot-standby path functionality) to be called before that code is
executed, but it is anyway right now. I'm not sure if this oversight
is simply an oversight, or indicative of a misplaced assumption
somewhere. Basically, my thoughts for a fix are to suppress
hot_standby = on (in spirit) before the consistent recovery state is
reached.

--
fdr

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jeff Janes 2011-10-23 21:52:35 Re: So, is COUNT(*) fast now?
Previous Message Tom Lane 2011-10-23 19:33:19 Re: termination of backend waiting for sync rep generates a junk log message