PITR COPY Failure (was Point in Time Recovery)

From: Mark Kirkwood <markir(at)coretech(dot)co(dot)nz>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: PITR COPY Failure (was Point in Time Recovery)
Date: 2004-07-18 05:50:11
Message-ID: 40FA0F93.8010607@coretech.co.nz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-admin pgsql-hackers pgsql-patches

I decided to produce a nice simple example, so that anyone could
hopefully replicate what I am seeing.

The scenario is the same as before (the 11 steps), but the CREATE TABLE
and COPY step has been reduced to:

CREATE TABLE test0 (filler VARCHAR(120));
COPY test0 FROM '/data0/dump/test0.dat' USING DELIMITERS ',';

Now the file 'test0.dat' consists of (128293) identical lines, each of
109 'a' charactors (plus end of line)

A script to run the whole business can be found here :

http://homepages.paradise.net.nz/markir/download/pitr-bug.tar.gz

(It will need a bit of editing for things like location of Pg, PGDATA,
and you will need to make your own data file)

The main points of interest are:
- anything <=128392 rows in test0.dat results in 1 archived log, and the
recovery succeeds
- anything >=128393 rows in test0.dat results in 2 or more archived
logs, and recovery fails on the second log (and gives the zero length
redo at 0/1FFFFE0 message).

Let me know if I can do any more legwork on this (I am considering
re-compiling with WAL_DEBUG now that example is simpler)

regards

Mark

Simon Riggs wrote:

>On Thu, 2004-07-15 at 10:47, Mark Kirkwood wrote:
>
>
>>I tried what I thought was a straightforward scenario, and seem to have
>>broken it :-(
>>
>>Here is the little tale
>>
>>1) initdb
>>2) set archive_mode and archive_dest in postgresql.conf
>>3) startup
>>4) create database called 'test'
>>5) connect to 'test' and type 'checkpoint'
>>6) backup PGDATA using 'tar -zcvf'
>>7) create tables in 'test' and add data using COPY (exactly 2 logs worth)
>>8) shutdown and remove PGDATA
>>9) recover using 'tar -zxvf'
>>10) copy recovery.conf into PGDATA
>>11) startup
>>
>>This is what I get :
>>
>>LOG: database system was interrupted at 2004-07-15 21:24:04 NZST
>>LOG: recovery command file found...
>>LOG: restore_program = cp %s/%s %s
>>LOG: recovery_target_inclusive = true
>>LOG: recovery_debug_log = true
>>LOG: starting archive recovery
>>LOG: restored log file "0000000000000000" from archive
>>LOG: checkpoint record is at 0/A48054
>>LOG: redo record is at 0/A48054; undo record is at 0/0; shutdown FALSE
>>LOG: next transaction ID: 496; next OID: 25419
>>LOG: database system was not properly shut down; automatic recovery in
>>progress
>>LOG: redo starts at 0/A48094
>>LOG: restored log file "0000000000000001" from archive
>>LOG: record with zero length at 0/1FFFFE0
>>LOG: redo done at 0/1FFFF30
>>LOG: restored log file "0000000000000001" from archive
>>LOG: restored log file "0000000000000001" from archive
>>PANIC: concurrent transaction log activity while database system is
>>shutting down
>>LOG: startup process (PID 13492) was terminated by signal 6
>>LOG: aborting startup due to startup process failure
>>
>>The concurrent access is a bit of a puzzle, as this is my home machine
>>(i.e. I am *sure* noone else is connected!)
>>
>>
>>
>
>I can see what is wrong now, but you'll have to help me on details your
>end...
>
>The log shows that xlog 1 was restored from archive. It contains a zero
>length record, which indicates that it isn't yet full (or thats what the
>existing recovery code assumes it means). Which also indicates that it
>should never have been archived in the first place, and should not
>therefore be a candidate for a restore from archive.
>
>The double message "restored log file" can only occur after you've
>retrieved a partially full file from archive - which as I say, shouldn't
>be there.
>
>Other messages are essentially spurious in those circumstances.
>
>Either:
>- somehow the files have been mixed up in the archive directory, which
>is possible if the filing discipline is not strict - various ways,
>unfortunately I would guess this to be the most likely, somehow
>- the file that has been restored has been damaged in some way
>- the archiver has archived a file too early (very unlikely, IMHO -
>thats the most robust bit of the code)
>- some aspect of the code has written a zero length record to WAL (which
>is supposed to not be possible, but we musn't discount an error in
>recent committed work)
>
>- there may also be an effect going on with checkpoints that I don't
>understand...spurious checkpoint warning messages have already been
>observed and reported,
>
>Best regards, Simon Riggs
>
>
>
>
>
>

In response to

Responses

Browse pgsql-admin by date

  From Date Subject
Next Message Tom Lane 2004-07-18 06:15:34 Re: PITR COPY Failure (was Point in Time Recovery)
Previous Message Tom Lane 2004-07-18 05:04:43 Re: [HACKERS] Point in Time Recovery

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2004-07-18 05:51:23 Re: NT + deadlock intended behaviour ?
Previous Message Tom Lane 2004-07-18 05:38:57 Re: NT + deadlock intended behaviour ?

Browse pgsql-patches by date

  From Date Subject
Next Message Tom Lane 2004-07-18 06:15:34 Re: PITR COPY Failure (was Point in Time Recovery)
Previous Message Alvaro Herrera 2004-07-18 05:48:30 Re: [subxacts] Savepoint syntax