Re: PITR problem

From: Erik Jones <erik(at)myemma(dot)com>
To: wstrzalka <wstrzalka(at)gmail(dot)com>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: PITR problem
Date: 2008-04-28 17:00:52
Message-ID: 5309E09C-113D-4CEB-9776-D6F01D40C4DE@myemma.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general


On Apr 26, 2008, at 5:11 PM, wstrzalka wrote:

> I have some problem with setting up PITR recovery on the database.
>
> I have archive_command set properly and logs are shipping OK. Archive
> timeout is also set (5 min).
>
> When performing pg_start_backup the WAL is lets say on position
> 0000000100000001000000D9, then I start copy database to the second
> machine which takes me 30 minutes. In that time archive timeout is
> called a few times and those file are shipped properly to the second
> host. After DB is succesfully copied i'm calling pg_stop_backup. The
> WAL is at the moment on position 0000000100000001000000DE.
>
> In that moment I see on the second machine WAL files from
> 0000000100000001000000D9 to 0000000100000001000000DE as well as
> 0000000100000001000000D9.00000020.backup
>
> The problem occurs now when I'm trying to start my standby server in
> recovery mode (with pg_standby).
>
> The output from pg_standby:
> ------------------------------------
> Trigger file : /tmp/pgsql.promote_trigger.5432
> Waiting for WAL file : 00000001.history
> WAL file path : /var/lib/pgsql/incoming_wal/
> 00000001.history
> Restoring to... : pg_xlog/RECOVERYHISTORY
> Sleep interval : 5 seconds
> Max wait interval : 0 forever
> Command for restore : ln -s -f "/var/lib/pgsql/incoming_wal/
> 00000001.history" "pg_xlog/RECOVERYHISTORY"
> Keep archive history : 0000000100000001000000DB and later
> running restore : OK
>
>
> Trigger file : /tmp/pgsql.promote_trigger.5432
> Waiting for WAL file : 0000000100000001000000D9.00000020.backup
> WAL file path : /var/lib/pgsql/incoming_wal/
> 0000000100000001000000D9.00000020.backup
> Restoring to... : pg_xlog/RECOVERYHISTORY
> Sleep interval : 5 seconds
> Max wait interval : 0 forever
> Command for restore : ln -s -f "/var/lib/pgsql/incoming_wal/
> 0000000100000001000000D9.00000020.backup" "pg_xlog/RECOVERYHISTORY"
> Keep archive history : 0000000100000001000000DB and later
> running restore : OK
>
>
> Trigger file : /tmp/pgsql.promote_trigger.5432
> Waiting for WAL file : 0000000100000001000000D9
> WAL file path : /var/lib/pgsql/incoming_wal/
> 0000000100000001000000D9
> Restoring to... : pg_xlog/RECOVERYXLOG
> Sleep interval : 5 seconds
> Max wait interval : 0 forever
> Command for restore : ln -s -f "/var/lib/pgsql/incoming_wal/
> 0000000100000001000000D9" "pg_xlog/RECOVERYXLOG"
> Keep archive history : 0000000100000001000000DB and later
> running restore : OK
> removing "/var/lib/pgsql/incoming_wal/0000000100000001000000D9"
> removing "/var/lib/pgsql/incoming_wal/0000000100000001000000DA"
>
> --------------------------------------------------------------------------------------------------------
>
>
> For the first time I start standby Postgres log says and the postgres
> process goes down:
> --------------------------------------------------------------------------------------------------------
> restored log file "0000000100000001000000D9.00000020.backup" from
> archive
> could not open file "pg_xlog/0000000100000001000000D9" (log file 1,
> segment 217): No such file or directory
> invalid checkpoint record
> could not locate required checkpoint record
> If you are not restoring from a backup, try removing the file "/var/
> lib/pgsql/data/backup_label".
> startup process (PID 19201) was terminated by signal 6: Aborted
> aborting startup due to startup process failure
> --------------------------------------------------------------------------------------------------------
>
> When I try to start PG for the second time it just stucks waiting
> for ...000D9
>
> In my opinion the problem is that when starting standby PostgresSQL
> wants to recovery WAL 0000000100000001000000D9, but first deletes it,
> as keep archive history (%r) param is set to
> 0000000100000001000000DB
>
> Is it a bug or I'm missing something? I can repeat the scenario with
> this big DB. However it's not happening on exactly the same
> environment when playing with smaller cluster (copying cluster is
> shorter then archive_timeout ).

What is the full pg_standby command string (restore_command=....) in
your recovery.conf. It sound's like you have pg_standby set to delete
archived WALs and possibly have that a little too aggressive. Do you
have the -k flag set in your pg_standby call in your restore_command?

Erik Jones

DBA | Emma®
erik(at)myemma(dot)com
800.595.4401 or 615.292.5888
615.292.0777 (fax)

Emma helps organizations everywhere communicate & market in style.
Visit us online at http://www.myemma.com

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Andrus 2008-04-28 17:05:45 Sorting nulls and empty strings together
Previous Message seijin 2008-04-28 16:39:24 String Comparison and NULL