Re: recovery_target_time ignored ?

From: David Steele <david(at)pgmasters(dot)net>
To: Venkata Balaji N <nag1010(at)gmail(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: recovery_target_time ignored ?
Date: 2015-03-24 14:28:38
Message-ID: 55117496.1010501@pgmasters.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 3/24/15 6:12 AM, Venkata Balaji N wrote:
>
> On Tue, Mar 24, 2015 at 9:54 AM, David Steele <david(at)pgmasters(dot)net
> <mailto:david(at)pgmasters(dot)net>> wrote:
>
> On 3/23/15 12:42 AM, Venkata Balaji N wrote:
> > Hi,
> >
> > Assuming that this might require a patch, i am posting this in
> > pgsql-hackers. Apologies, if this is not the appropriate mailing list to
> > start this discussion.
> >
> > I performed a PITR and saw the below message in the log file is a bit
> > confusing.
> >
> > 2015-03-23 13:49:09.816 GMT-10 DB= PID=4707 LOG: *database system was
> > interrupted; last known up at 2015-03-23 10:30:26 GMT-10*
> > 2015-03-23 13:49:09.817 GMT-10 DB= PID=4707 LOG: *starting
> point-in-time
> > recovery to 2015-03-23 10:00:26+10*
> > 2015-03-23 13:49:09.827 GMT-10 DB= PID=4707 LOG: restored log file
> > "000000010000000B00000020" from archive
> > 2015-03-23 13:49:09.888 GMT-10 DB= PID=4707 LOG: redo starts at B/20000090
> > 2015-03-23 13:49:09.937 GMT-10 DB= PID=4707 LOG: consistent recovery
> > state reached at B/200000B8
> > 2015-03-23 13:49:09.947 GMT-10 DB= PID=4707 LOG: restored log file
> > "000000010000000B00000021" from archive
> > 2015-03-23 13:49:09.950 GMT-10 DB= PID=4707 LOG: *recovery stopping
> > before commit of transaction 16267, time 2015-03-23 13:22:37.53007+10*
> >
> >
> > By mistake i gave "recovery_target_time" as 10:00 GMT which is 25/30
> > minutes behind the backup start/end time registered in the backup_label.
> >
> > The parameter recovery_target_time is ignored and recovery proceeds
> > further applying all the available WAL Archive files finally ends up
> > bringing up the database.
> >
> > I think it would make sense if the recovery does not proceed any further
> > and error out with a message like "recovery_target_time is behind the
> > backup time.. please consider using the backup taken prior to the
> > recovery_target_time"
>
> I just tried it with 9.3.5 and I do get an error:
>
> LOG: starting point-in-time recovery to 2015-03-23 17:26:02.721307-04
> LOG: restored log file "000000010000000000000003" from archive
> LOG: redo starts at 0/30000C8
> LOG: recovery stopping before commit of transaction 1001, time
> 2015-03-23 18:26:01.012593-04
> LOG: redo done at 0/3000228
> FATAL: requested recovery stop point is before consistent recovery
> point
>
>
> That makes more sense. This is what i was expecting to happen. Then, i
> think it is something to do with the timestamp format.
>
>
> Here's my recovery.conf file:
>
> restore_command = '/usr/bin/pg_backrest.pl <http://pg_backrest.pl>
> --stanza=db archive-get %f "%p"'
> recovery_target_time = '2015-03-23 17:26:02.721307 EDT'
>
> > recovery.conf file is as follows :
> >
> > restore_command='cp /data/pgdata9400backup/pgwalarch9400backup/%f %p '
> > recovery_target_time='2015-03-23 10:00:26 GMT-10'
> > recovery_target_inclusive='true'
>
> You have '2015-03-23 10:00:26 GMT-10' in recovery.conf but the log says
> 'starting point-in-time recovery to 2015-03-23 10:00:26+10'. Note the -
> vs +.
>
>
> This is my confusion too. I picked up the time format from the backup label.
>
> Could you check your log and recovery.conf and make sure the timezone
> offsets are actually different?
>
>
> I am not sure why the timestamp is taken as "2015-03-23 10:00:26+10"
> for "2015-03-23 10:00:26 GMT-10'".
>
> My another system's timestamp format is also AEDT. I did another test
> and I get the same problem.
>
> Below is the text from the log file. I gave a recovery_target_time
> almost a day behind the consistent recovery point. Still, the
> recovery_target_time is taken as "*2015-03-23 11:07:10+11*" for
> "*2015-03-23 11:07:10 AEDT*"
>
> 2015-03-24 18:42:44.608 AEDT LOG: database system was interrupted;
> last known up at *2015-03-24 18:20:53 AEDT*
>
> 2015-03-24 18:42:44.608 AEDT LOG: starting point-in-time recovery
> to *2015-03-23 11:07:10+11*
>
> cp: /disk3/pgwalarch9401/0000000100000003000000FE: No such file or
> directory
>
> 2015-03-24 18:42:44.626 AEDT LOG: record with zero length at 3/FE000090
>
> Below is my recovery.conf file
>
> restore_command='cp /disk3/pgwalarch9401/%f %p'
>
> recovery_target_time='*2015-03-23 11:07:10 AEDT*'
>
> recovery_target_inclusive=true
>
> I am checking if this has something to do with my system timestamp format.
>
> Not sure what am i missing. Do i need to give any special time format ?

This is a weird one. I can reproduce the problem by setting
timezone=timezone='Australia/Sydney'. I also tried setting
log_timezone=timezone='Australia/Sydney' and my system clock to
'Australia/Sydney' but I still saw the same issue.

I'm testing this using unit tests for some backup software I'm working on.

1) Postgres is inited and started:

initdb -D /test/db/common -A trust;pg_ctl start -o "-c port=6543 -c
checkpoint_segments=1 -c log_timezone='Australia/Sydney' -c
timezone='Australia/Sydney' -c wal_level=hot_standby -c archive_mode=on
-c archive_command='/test/bin/pg_backrest.pl --stanza=db archive-push
%p' -c unix_socket_directories='/test/db'" -D /test/db/common -l
/test/db/common/postgresql.log -w -s

2) The backup/restore unit tests run

2015-03-24 09:56:19 T00 INFO: run 001 - rmt 0, arc_async 0, cmp 0
2015-03-24 09:56:23 T00 INFO: full backup (insert during backup)
2015-03-24 09:56:26 T00 INFO: time target is 2015-03-25
00:56:26.783117 EST
2015-03-24 09:56:26 T00 INFO: incr backup (update during backup)
2015-03-24 09:56:30 T00 INFO: xid target is 1007
2015-03-24 09:56:30 T00 INFO: name target is backrest
2015-03-24 09:56:30 T00 INFO: testing recovery type = default
2015-03-24 09:56:30 T00 INFO: restore, backup
'20150324-095626F', type 'default', expect exit 113 (postmaster running)
2015-03-24 09:56:33 T00 INFO: restore, backup
'20150324-095626F', type 'default', expect exit 115 (path not empty)
2015-03-24 09:56:33 T00 INFO: restore, backup
'20150324-095626F', type 'default'
2015-03-24 09:56:40 T00 INFO: testing recovery type = xid
2015-03-24 09:56:41 T00 INFO: restore, force, backup
'20150324-095626F', type 'xid', target '1007', resume
2015-03-24 09:56:47 T00 INFO: testing recovery type = preserve
2015-03-24 09:56:48 T00 INFO: restore delta, backup
'20150324-095626F', type 'preserve'
2015-03-24 09:56:54 T00 INFO: testing recovery type = time
2015-03-24 09:56:55 T00 INFO: restore delta, backup
'20150324-095626F', type 'time', target '2015-03-25 00:56:26.783117 EST'
expected value 'time' from 'select message from test' but actual was
'name' at lib/BackRestTest/BackupTest.pm line 178.

3) The "recovery type = time" test is failing because it's running to
the end of the WAL stream rather than stopping at the appropriate time.

Here's the recovery.conf:

restore_command = '/test/bin/pg_backrest.pl --stanza=db archive-get %f "%p"'
recovery_target_time = '2015-03-25 00:56:26.783117 EST'

And the log file:

LOG: database system was interrupted; last known up at 2015-03-25
00:56:24 EST
LOG: creating missing WAL directory "pg_xlog/archive_status"
LOG: starting point-in-time recovery to 2015-03-25 16:56:26.783117+11
LOG: restored log file "000000010000000000000003" from archive
LOG: redo starts at 0/30000C8
LOG: consistent recovery state reached at 0/3000280
LOG: restored log file "000000010000000000000004" from archive
LOG: restored log file "000000010000000000000005" from archive
FATAL: the database system is starting up
LOG: restored log file "000000010000000000000006" from archive
LOG: restored log file "000000010000000000000007" from archive
LOG: restored log file "000000010000000000000008" from archive
LOG: restored log file "000000010000000000000009" from archive
LOG: restored log file "00000001000000000000000A" from archive
FATAL: the database system is starting up
LOG: restored log file "00000001000000000000000B" from archive
LOG: restored log file "00000001000000000000000C" from archive
LOG: redo done at 0/C000130
LOG: last completed transaction was at log time 2015-03-25
00:56:30.658752+11
LOG: restored log file "00000001000000000000000C" from archive
FATAL: the database system is starting up
LOG: restored log file "00000002.history" from archive
LOG: restored log file "00000003.history" from archive
LOG: restored log file "00000004.history" from archive
LOG: selected new timeline ID: 5
LOG: archive recovery complete
LOG: database system is ready to accept connections
LOG: autovacuum launcher started

There's definitely some kind of multiple TZ conversion going on here,
but I'm not see why. However, I did get the test to work again by
removing the TZ from the timestamp in recovery.conf:

restore_command = '/test/bin/pg_backrest.pl --stanza=db archive-get %f "%p"'
recovery_target_time = '2015-03-25 01:08:45.765919'

Here's the log:

LOG: database system was interrupted; last known up at 2015-03-25
01:08:43 EST
LOG: creating missing WAL directory "pg_xlog/archive_status"
LOG: starting point-in-time recovery to 2015-03-25 01:08:45.765919+11
LOG: restored log file "000000010000000000000003" from archive
LOG: redo starts at 0/30000C8
LOG: consistent recovery state reached at 0/3000280
LOG: restored log file "000000010000000000000004" from archive
LOG: restored log file "000000010000000000000005" from archive
LOG: recovery stopping before commit of transaction 1003, time
2015-03-25 01:08:45.770163+11
LOG: redo done at 0/5013900
LOG: last completed transaction was at log time 2015-03-25
01:08:45.765465+11
LOG: restored log file "00000002.history" from archive
FATAL: the database system is starting up
LOG: restored log file "00000003.history" from archive
LOG: restored log file "00000004.history" from archive
LOG: selected new timeline ID: 5
LOG: archive recovery complete
LOG: database system is ready to accept connections
LOG: autovacuum launcher started

You can see that the timestamps now match up and the test was
successful. I know this doesn't completely solve the problem, but
hopefully it will get you going again and might provide a clue to the
underlying behavior.

There may also be some confusion because EST is used in both Australia
and the U.S (though it is EDT here now). I wasn't able to set the
timezone on Postgres to AEDT or get any output that said AEDT. I'd be
curious to know how you did that.

These tests were run on my OSX laptop which might introduce even more
variables. In production I recommend running the OS and DB on UTC to
avoid such confusion (though it does introduce a different set of problems).

--
- David Steele
david(at)pgmasters(dot)net

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2015-03-24 15:33:13 Re: Replication identifiers, take 4
Previous Message Sawada Masahiko 2015-03-24 14:28:01 Re: Auditing extension for PostgreSQL (Take 2)