Re: Hot Backup with rsync fails at pg_clog if under load

From: Chris Redekop <chris(at)replicon(dot)com>
To: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Hot Backup with rsync fails at pg_clog if under load
Date: 2011-10-17 21:30:49
Message-ID: CAC2SuRJPNqPe7Ga8LT6Q-vOgn05BmiH=y6F_dfiVu5u3NfOs=w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Well, on the other hand maybe there is something wrong with the data.
Here's the test/steps I just did -
1. I do the pg_basebackup when the master is under load, hot slave now will
not start up but warm slave will.
2. I start a warm slave and let it catch up to current
3. On the slave I change 'hot_standby=on' and do a 'service postgresql
restart'
4. The postgres fails to restart with the same error.
5. I turn hot_standby back off and postgres starts back up fine as a warm
slave
6. I then turn off the load, the slave is all caught up, master and slave
are both sitting idle
7. I, again, change 'hot_standby=on' and do a service restart
8. Again it fails, with the same error, even though there is no longer any
load.
9. I repeat this warmstart/hotstart cycle a couple more times until to my
surprise, instead of failing, it successfully starts up as a hot standby
(this is after maybe 5 minutes or so of sitting idle)

So...given that it continued to fail even after the load had been turned of,
that makes me believe that the data which was copied over was invalid in
some way. And when a checkpoint/logrotation/somethingelse occurred when not
under load it cleared itself up....I'm shooting in the dark here

Anyone have any suggestions/ideas/things to try?

On Mon, Oct 17, 2011 at 2:13 PM, Chris Redekop <chris(at)replicon(dot)com> wrote:

> I can confirm that both the pg_clog and pg_subtrans errors do occur when
> using pg_basebackup instead of rsync. The data itself seems to be fine
> because using the exact same data I can start up a warm standby no problem,
> it is just the hot standby that will not start up.
>
>
> On Sat, Oct 15, 2011 at 7:33 PM, Chris Redekop <chris(at)replicon(dot)com> wrote:
>
>> > > Linas, could you capture the output of pg_controldata *and* increase
>> the
>> > > log level to DEBUG1 on the standby? We should then see nextXid value
>> of
>> > > the checkpoint the recovery is starting from.
>> >
>> > I'll try to do that whenever I'm in that territory again...
>> Incidentally,
>> > recently there was a lot of unrelated-to-this-post work to polish things
>> up
>> > for a talk being given at PGWest 2011 Today :)
>> >
>> > > I also checked what rsync does when a file vanishes after rsync
>> computed the
>> > > file list, but before it is sent. rsync 3.0.7 on OSX, at least,
>> complains
>> > > loudly, and doesn't sync the file. It BTW also exits non-zero, with a
>> special
>> > > exit code for precisely that failure case.
>> >
>> > To be precise, my script has logic to accept the exit code 24, just as
>> > stated in PG manual:
>> >
>> > Docs> For example, some versions of rsync return a separate exit code
>> for
>> > Docs> "vanished source files", and you can write a driver script to
>> accept
>> > Docs> this exit code as a non-error case.
>>
>> I also am running into this issue and can reproduce it very reliably. For
>> me, however, it happens even when doing the "fast backup" like so:
>> pg_start_backup('whatever', true)...my traffic is more write-heavy than
>> linas's tho, so that might have something to do with it. Yesterday it
>> reliably errored out on pg_clog every time, but today it is
>> failing sporadically on pg_subtrans (which seems to be past where the
>> pg_clog error was)....the only thing that has changed is that I've changed
>> the log level to debug1....I wouldn't think that could be related though.
>> I've linked the requested pg_controldata and debug1 logs for both errors.
>> Both links contain the output from pg_start_backup, rsync, pg_stop_backup,
>> pg_controldata, and then the postgres debug1 log produced from a subsequent
>> startup attempt.
>>
>> pg_clog: http://pastebin.com/mTfdcjwH
>> pg_subtrans: http://pastebin.com/qAXEHAQt
>>
>> Any workarounds would be very appreciated.....would copying clog+subtrans
>> before or after the rest of the data directory (or something like that) make
>> any difference?
>>
>> Thanks!
>>
>
>

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message desmodemone 2011-10-17 22:09:35 BUG or strange behaviour of update on primary key
Previous Message Chris Redekop 2011-10-17 20:13:50 Re: Hot Backup with rsync fails at pg_clog if under load