Re: Online base backup from the hot-standby

From: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To: Steve Singer <ssinger_pg(at)sympatico(dot)ca>
Cc: Jun Ishiduka <ishizuka(dot)jun(at)po(dot)ntts(dot)co(dot)jp>, pgsql-hackers(at)postgresql(dot)org, cedric(dot)villemain(dot)debian(at)gmail(dot)com, robertmhaas(at)gmail(dot)com, magnus(at)hagander(dot)net, heikki(dot)linnakangas(at)enterprisedb(dot)com
Subject: Re: Online base backup from the hot-standby
Date: 2011-09-27 02:56:25
Message-ID: CAHGQGwHQdzoAX1Op0MBiJcSRNAQLxS6L_5Q7giueMBWg4LbJEQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Sep 26, 2011 at 11:39 AM, Steve Singer <ssinger_pg(at)sympatico(dot)ca> wrote:
> I have looked at both Jun's patch from Sept 13 and Fujii's updates to the
> patch.  I agree that Fujii's updated version should be used as the basis for
> changes going forward.   My comments below refer to that version (unless
> otherwise noted).

Thanks for the tests and comments!

> In backup.sgml  the new section titled "Making a Base Backup during
> Recovery"  I would prefer to see some mention in the title that this
> procedure is for standby servers ie "Making a Base Backup from a Standby
> Database".  Users who have setup a hot-standby database should be familiar
> with the 'standby' terminology. I agree that the "during recovery"
> description is technically correct but I'm not sure someone who is looking
> through the manual for instructions on making a base backup from here
> standby will realize this is the section they should read.

I used the term "recovery" rather than "standby" because we can take
a backup even from the server in normal archive recovery mode but not
standby mode. But there is not many users who take a backup during
normal archive recovery, so I agree that the term "standby" is better to
be used in the document. Will change.

> Around line 969 where you give an example of copying the control file I
> would be a bit clearer that this is an example command.  Ie (Copy the
> pg_control file from the cluster directory to the global sub-directory of
> the backup.  For example "cp $PGDATA/global/pg_control
> /mnt/server/backupdir/global")

Looks better. Will change.

> Testing Notes
> -----------------------------
>
> I created a standby server from a base backup of another standby server. On
> this new standby server I then
>
> 1. Ran pg_start_backup('3'); and left the psql connection open
> 2. touch /tmp/3 -- my trigger_file
>
> ssinger(at)ssinger-laptop:/usr/local/pgsql92git/bin$ LOG:  trigger file found:
> /tmp/3
> FATAL:  terminating walreceiver process due to administrator command
> LOG:  restored log file "000000010000000000000006" from archive
> LOG:  record with zero length at 0/60002F0
> LOG:  restored log file "000000010000000000000006" from archive
> LOG:  redo done at 0/6000298
> LOG:  restored log file "000000010000000000000006" from archive
> PANIC:  record with zero length at 0/6000298
> LOG:  startup process (PID 19011) was terminated by signal 6: Aborted
> LOG:  terminating any other active server processes
> WARNING:  terminating connection because of crash of another server process
> DETAIL:  The postmaster has commanded this server process to roll back the
> current transaction and exit, because another server process exited
> abnormally and possibly corrupted shared memory.
> HINT:  In a moment you should be able to reconnect to the database and
> repeat your command.
>
> The new postmaster (the one trying to be promoted) dies.  This is somewhat
> repeatable.

Looks weired. Though the WAL record starting from 0/6000298 was read
successfully, then re-fetch of the same record fails at the end of recovery.
One possible cause is the corruption of archived WAL file. What
restore_command on the standby and archive_command on the master
are you using? Could you confirm that there is no chance to overwrite
archive WAL files in your environment?

I tried to reproduce this problem several times, but I could not. Could
you provide the test case which reproduces the problem?

> If a base backup is in progress on a recovery database and that recovery
> database is promoted to master, following the promotion (if you don't
> restart the postmaster).  I see
> select pg_stop_backup();
> ERROR:  database system status mismatches between pg_start_backup() and
> pg_stop_backup()
>
> If you restart the postmaster this goes away.  When the postmaster leaves
> recovery mode I think it should abort an existing base backup so
> pg_stop_backup() will say no backup in progress,

I don't think that it's good idea to cancel the backup when promoting
the standby.
Because if we do so, we need to handle correctly the case where cancel of backup
and pg_start_backup/pg_stop_backup are performed at the same time. We can
simply do that by protecting those whole operations including pg_start_backup's
checkpoint by the lwlock. But I don't think that it's worth
introducing new lwlock
only for that. And it's not good to take a lwlock through
time-consuming checkpoint
operation. Of course we can avoid such a lwlock, but which would require more
complicated code.

> or give an error message on
> pg_stop_backup() saying that the base backup won't be usable.  The above
> error doesn't really tell the user why there is a mismatch.

What about the following error message?

ERROR: pg_stop_backup() was executed during normal processing though
pg_start_backup() was executed during recovery
HINT: The database backup will not be usable.

Or, you have better idea?

> In my testing a few times I got into a situation where a standby server
> coming from a recovery target took a while to finish recovery (this is on a
> database with no activity).  Then when i tried promoting that server to
> master I got
>
> LOG:  trigger file found: /tmp/3
> FATAL:  terminating walreceiver process due to administrator command
> LOG:  restored log file "000000010000000000000009" from archive
> LOG:  restored log file "000000010000000000000009" from archive
> LOG:  redo done at 0/90000E8
> LOG:  restored log file "000000010000000000000009" from archive
> PANIC:  unexpected pageaddr 0/6000000 in log file 0, segment 9, offset 0
> LOG:  startup process (PID 1804) was terminated by signal 6: Aborted
> LOG:  terminating any other active server processes
>
> It is *possible* I mixed up the order of a step somewhere since my testing
> isn't script based. A standby server that 'looks' okay but can't actually be
> promoted is dangerous.

Looks the same problem as the above. Another weired point is that
the same archived WAL file is restored two times before redo is done.
I'm not sure why this happens... Could you provide the test case which
reproduces this problem? Will diagnose.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2011-09-27 03:29:00 Re: random isolation test failures
Previous Message Bruce Momjian 2011-09-27 02:09:44 Re: pg_upgrade automatic testing