Re: Online base backup from the hot-standby

From: Steve Singer <ssinger_pg(at)sympatico(dot)ca>
To: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc: Jun Ishiduka <ishizuka(dot)jun(at)po(dot)ntts(dot)co(dot)jp>, pgsql-hackers(at)postgresql(dot)org, cedric(dot)villemain(dot)debian(at)gmail(dot)com, robertmhaas(at)gmail(dot)com, magnus(at)hagander(dot)net, heikki(dot)linnakangas(at)enterprisedb(dot)com
Subject: Re: Online base backup from the hot-standby
Date: 2011-09-26 02:39:00
Message-ID: BLU0-SMTP598D7314D4D1468AA0C4238EF30@phx.gbl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 11-09-22 09:24 AM, Fujii Masao wrote:
> On Wed, Sep 21, 2011 at 11:50 AM, Fujii Masao<masao(dot)fujii(at)gmail(dot)com> wrote:
>> 2011/9/13 Jun Ishiduka<ishizuka(dot)jun(at)po(dot)ntts(dot)co(dot)jp>:
>>> Update patch.
>>>
>>> Changes:
>>> * set 'on' full_page_writes by user (in document)
>>> * read "FROM: XX" in backup_label (in xlog.c)
>>> * check status when pg_stop_backup is executed (in xlog.c)
>> Thanks for updating the patch.
>>
>> Before reviewing the patch, to encourage people to comment and
>> review the patch, I explain what this patch provides:
> Attached is the updated version of the patch. I refactored the code, fixed
> some bugs, added lots of source code comments, improved the document,
> but didn't change the basic design. Please check this patch, and let's use
> this patch as the base if you agree with that.
>

I have looked at both Jun's patch from Sept 13 and Fujii's updates to
the patch. I agree that Fujii's updated version should be used as the
basis for changes going forward. My comments below refer to that
version (unless otherwise noted).

In backup.sgml the new section titled "Making a Base Backup during
Recovery" I would prefer to see some mention in the title that this
procedure is for standby servers ie "Making a Base Backup from a Standby
Database". Users who have setup a hot-standby database should be
familiar with the 'standby' terminology. I agree that the "during
recovery" description is technically correct but I'm not sure someone
who is looking through the manual for instructions on making a base
backup from here standby will realize this is the section they should read.

Around line 969 where you give an example of copying the control file I
would be a bit clearer that this is an example command. Ie (Copy the
pg_control file from the cluster directory to the global sub-directory
of the backup. For example "cp $PGDATA/global/pg_control
/mnt/server/backupdir/global")

Testing Notes
-----------------------------

I created a standby server from a base backup of another standby server.
On this new standby server I then

1. Ran pg_start_backup('3'); and left the psql connection open
2. touch /tmp/3 -- my trigger_file

ssinger(at)ssinger-laptop:/usr/local/pgsql92git/bin$ LOG: trigger file
found: /tmp/3
FATAL: terminating walreceiver process due to administrator command
LOG: restored log file "000000010000000000000006" from archive
LOG: record with zero length at 0/60002F0
LOG: restored log file "000000010000000000000006" from archive
LOG: redo done at 0/6000298
LOG: restored log file "000000010000000000000006" from archive
PANIC: record with zero length at 0/6000298
LOG: startup process (PID 19011) was terminated by signal 6: Aborted
LOG: terminating any other active server processes
WARNING: terminating connection because of crash of another server process
DETAIL: The postmaster has commanded this server process to roll back
the current transaction and exit, because another server process exited
abnormally and possibly corrupted shared memory.
HINT: In a moment you should be able to reconnect to the database and
repeat your command.

The new postmaster (the one trying to be promoted) dies. This is
somewhat repeatable.

----

If a base backup is in progress on a recovery database and that recovery
database is promoted to master, following the promotion (if you don't
restart the postmaster). I see
select pg_stop_backup();
ERROR: database system status mismatches between pg_start_backup() and
pg_stop_backup()

If you restart the postmaster this goes away. When the postmaster
leaves recovery mode I think it should abort an existing base backup so
pg_stop_backup() will say no backup in progress, or give an error
message on pg_stop_backup() saying that the base backup won't be
usable. The above error doesn't really tell the user why there is a
mismatch.

---------

In my testing a few times I got into a situation where a standby server
coming from a recovery target took a while to finish recovery (this is
on a database with no activity). Then when i tried promoting that
server to master I got

LOG: trigger file found: /tmp/3
FATAL: terminating walreceiver process due to administrator command
LOG: restored log file "000000010000000000000009" from archive
LOG: restored log file "000000010000000000000009" from archive
LOG: redo done at 0/90000E8
LOG: restored log file "000000010000000000000009" from archive
PANIC: unexpected pageaddr 0/6000000 in log file 0, segment 9, offset 0
LOG: startup process (PID 1804) was terminated by signal 6: Aborted
LOG: terminating any other active server processes

It is *possible* I mixed up the order of a step somewhere since my
testing isn't script based. A standby server that 'looks' okay but can't
actually be promoted is dangerous.

This version of the patch (I was testing the Sept 22nd version) seems
less stable than how I remember the version from the July CF. Maybe I'm
just testing it harder or maybe something has been broken.

> In the current patch, there is no safeguard for preventing users from
> taking backup during recovery when FPW is disabled. This is unsafe.
> Are you planning to implement such a safeguard?
>

I agree with Fujii that we need a way (on the recovery machine) to
detect if the master doesn't have FPW on. The ideas up-thread on how to
do this sound good.

> Regards,
>
>
>
>

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2011-09-26 03:22:56 Re: [v9.2] Fix Leaky View Problem
Previous Message Noah Misch 2011-09-26 02:38:18 Re: [v9.2] Fix Leaky View Problem