Skip site navigation (1) Skip section navigation (2)

Re: Online base backup from the hot-standby

From: Steve Singer <ssinger_pg(at)sympatico(dot)ca>
To: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc: Jun Ishiduka <ishizuka(dot)jun(at)po(dot)ntts(dot)co(dot)jp>, pgsql-hackers(at)postgresql(dot)org, cedric(dot)villemain(dot)debian(at)gmail(dot)com, robertmhaas(at)gmail(dot)com, magnus(at)hagander(dot)net, heikki(dot)linnakangas(at)enterprisedb(dot)com
Subject: Re: Online base backup from the hot-standby
Date: 2011-09-27 23:10:39
Message-ID: BLU0-SMTP5C178B39ECE60A64B08158EF00@phx.gbl (view raw or flat)
Thread:
Lists: pgsql-hackers
On 11-09-26 10:56 PM, Fujii Masao wrote:
>
> Looks weired. Though the WAL record starting from 0/6000298 was read
> successfully, then re-fetch of the same record fails at the end of recovery.
> One possible cause is the corruption of archived WAL file. What
> restore_command on the standby and archive_command on the master
> are you using? Could you confirm that there is no chance to overwrite
> archive WAL files in your environment?
>
> I tried to reproduce this problem several times, but I could not. Could
> you provide the test case which reproduces the problem?
>

This is the test procedure I'm trying today, I wasn't able to reproduce 
the crash.  What I was doing the other day was similar but I can't speak 
to unintentional differences.


I have my master server
data
port=5439
wal_level=hot_standby
archive_mode=on
archive_command='cp -i %p /usr/local/pgsql92git/archive/%f'
hot_standby=on

I then run
select pg_start_backup('foo');
$ rm -r ../data2
$ cp -r ../data ../data2
$ rm ../data2/postmaster.pid
select pg_stop_backup();
I edit data2/postgresql.conf so
port=5438
I commented out archive_mode and archive_command (or at least today I did)
recovery.conf is

standby_mode='on'
primary_conninfo='host=127.0.0.1 port=5439 user=ssinger dbname=test'
restore_command='cp /usr/local/pgsql92git/archive/%f %p'

I then start up the second cluster. On it I run

select pg_start_backup('1');
$ rm -r ../data3
$ rm -r ../archive2
$ cp -r ../data2 ../data3
$ cp ../data2/global/pg_control ../data3/global

select pg_stop_backup();
I edit ../data2/postgresql.conf
port=5437
archive_mode=on
                                 # (change requires restart)
archive_command='cp -i %p /usr/local/pgsql92git/archive2/%f'

recovery.conf is

standby_mode='on'
primary_conninfo='host=127.0.0.1 port=5439 user=ssinger dbname=test'
restore_command='cp /usr/local/pgsql92git/archive/%f %p'
trigger_file='/tmp/3'

$ postgres -D ../data3

The first time I did this postgres came up quickly.

$ touch /tmp/3

worked fine.

I then stopped data3
$ rm -r ../data3
on data 2 I run
pg_start_backup('1')
$ cp -r ../data2 ../data3
$ cp ../data2/global/pg_control ../data3/global
select pg_stop_backup() # on data2
$ rm ../data3/postmaster.pid
vi ../data3/postgresql.conf # same changes as above for data3
vi ../data3/recovery.conf # same as above for data 3
postgres -D ../data3

This time I got
./postgres -D ../data3
LOG:  database system was interrupted while in recovery at log time 
2011-09-27 22:04:17 GMT
HINT:  If this has occurred more than once some data might be corrupted 
and you might need to choose an earlier recovery target.
LOG:  entering standby mode
cp: cannot stat 
`/usr/local/pgsql92git/archive/00000001000000000000000C': No such file 
or directory
LOG:  redo starts at 0/C000020
LOG:  record with incorrect prev-link 0/9000058 at 0/C0000B0
cp: cannot stat 
`/usr/local/pgsql92git/archive/00000001000000000000000C': No such file 
or directory
LOG:  streaming replication successfully connected to primary
FATAL:  the database system is starting up
FATAL:  the database system is starting up
LOG:  consistent recovery state reached at 0/C0000E8
LOG:  database system is ready to accept read only connections

In order to get the database to come in read only mode I manually issued 
a checkpoint on the master (data) shortly after the checkpoint command 
the data3 instance went to read only mode.

then

touch /tmp/3

trigger file found: /tmp/3
FATAL:  terminating walreceiver process due to administrator command
cp: cannot stat 
`/usr/local/pgsql92git/archive/00000001000000000000000C': No such file 
or directory
LOG:  record with incorrect prev-link 0/9000298 at 0/C0002F0
cp: cannot stat 
`/usr/local/pgsql92git/archive/00000001000000000000000C': No such file 
or directory
LOG:  redo done at 0/C000298
cp: cannot stat 
`/usr/local/pgsql92git/archive/00000001000000000000000C': No such file 
or directory
cp: cannot stat `/usr/local/pgsql92git/archive/00000002.history': No 
such file or directory
LOG:  selected new timeline ID: 2
cp: cannot stat `/usr/local/pgsql92git/archive/00000001.history': No 
such file or directory
LOG:  archive recovery complete
LOG:  database system is ready to accept connections
LOG:  autovacuum launcher started


It looks like data3 is still pulling files with the recovery command 
after it sees the touch file (is this expected behaviour?)
$ grep archive ../data3/postgresql.conf
#wal_level = minimal            # minimal, archive, or hot_standby
#archive_mode = off        # allows archiving to be done
archive_mode=on
archive_command='cp -i %p /usr/local/pgsql92git/archive2/%f'


I have NOT been able to make postgres crash during a recovery (today).  
It is *possible* that on some of my runs the other day I had skipped 
changing the archive command on data3 to write to archive2 instead of 
archive.

I have also today not been able to get it to attempt to restore the same 
WAL file twice.


>> If a base backup is in progress on a recovery database and that recovery
>> database is promoted to master, following the promotion (if you don't
>> restart the postmaster).  I see
>> select pg_stop_backup();
>> ERROR:  database system status mismatches between pg_start_backup() and
>> pg_stop_backup()
>>
>> If you restart the postmaster this goes away.  When the postmaster leaves
>> recovery mode I think it should abort an existing base backup so
>> pg_stop_backup() will say no backup in progress,
> I don't think that it's good idea to cancel the backup when promoting
> the standby.
> Because if we do so, we need to handle correctly the case where cancel of backup
> and pg_start_backup/pg_stop_backup are performed at the same time. We can
> simply do that by protecting those whole operations including pg_start_backup's
> checkpoint by the lwlock. But I don't think that it's worth
> introducing new lwlock
> only for that. And it's not good to take a lwlock through
> time-consuming checkpoint
> operation. Of course we can avoid such a lwlock, but which would require more
> complicated code.
>
>> or give an error message on
>> pg_stop_backup() saying that the base backup won't be usable.  The above
>> error doesn't really tell the user why there is a mismatch.
> What about the following error message?
>
> ERROR:  pg_stop_backup() was executed during normal processing though
> pg_start_backup() was executed during recovery
> HINT:  The database backup will not be usable.
>
> Or, you have better idea?

I like that error message better.  It tells me what is going on versus 
complaining about a state mismatch.
>> In my testing a few times I got into a situation where a standby server
>> coming from a recovery target took a while to finish recovery (this is on a
>> database with no activity).  Then when i tried promoting that server to
>> master I got
>>
>> LOG:  trigger file found: /tmp/3
>> FATAL:  terminating walreceiver process due to administrator command
>> LOG:  restored log file "000000010000000000000009" from archive
>> LOG:  restored log file "000000010000000000000009" from archive
>> LOG:  redo done at 0/90000E8
>> LOG:  restored log file "000000010000000000000009" from archive
>> PANIC:  unexpected pageaddr 0/6000000 in log file 0, segment 9, offset 0
>> LOG:  startup process (PID 1804) was terminated by signal 6: Aborted
>> LOG:  terminating any other active server processes
>>
>> It is *possible* I mixed up the order of a step somewhere since my testing
>> isn't script based. A standby server that 'looks' okay but can't actually be
>> promoted is dangerous.
> Looks the same problem as the above. Another weired point is that
> the same archived WAL file is restored two times before redo is done.
> I'm not sure why this happens... Could you provide the test case which
> reproduces this problem? Will diagnose.
>
> Regards,
>


In response to

Responses

pgsql-hackers by date

Next:From: Steve CrawfordDate: 2011-09-27 23:13:41
Subject: pg_upgrade - add config directory setting
Previous:From: Florian PflugDate: 2011-09-27 23:05:51
Subject: Re: [PATCH] Log crashed backend's query v2

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group