Re: Hot Backup with rsync fails at pg_clog if under load

From: Florian Pflug <fgp(at)phlo(dot)org>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: Simon Riggs <simon(at)2ndQuadrant(dot)com>, Aidan Van Dyk <aidan(at)highrise(dot)ca>, Daniel Farina <daniel(at)heroku(dot)com>, Chris Redekop <chris(at)replicon(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Hot Backup with rsync fails at pg_clog if under load
Date: 2011-10-27 12:03:23
Message-ID: AAE7BBFB-641C-482C-A8D2-0D61E2DF3B09@phlo.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Oct27, 2011, at 08:57 , Heikki Linnakangas wrote:
> On 27.10.2011 02:29, Florian Pflug wrote:
>> Per my theory about the cause of the problem in my other mail, I think you
>> might see StartupCLOG failures even during crash recovery, provided that
>> wal_level was set to hot_standby when the primary crashed. Here's how
>>
>> 1) We start a checkpoint, and get as far as LogStandbySnapshot()
>> 2) A backend does AssignTransactionId, and gets as far as GetTransactionoId().
>> The assigned XID requires CLOG extension.
>> 3) The checkpoint continues, and LogStandbySnapshot () advances the
>> checkpoint's nextXid to the XID assigned in (2).
>> 4) We crash after writing the checkpoint record, but before the CLOG
>> extension makes it to the disk, and before any trace of the XID assigned
>> in (2) makes it to the xlog.
>>
>> Then StartupCLOG() would fail at the end of recovery, because we'd end up
>> with a nextXid whose corresponding CLOG page doesn't exist.
>
> No, clog extension is WAL-logged while holding the XidGenLock. At step 3,
> LogStandbySnapshot() would block until the clog-extension record is written
> to WAL, so crash recovery would see and replay that record before calling
> StartupCLOG().

Hm, true. But it still seems wrong for LogStandbySnapshot() to modify the
checkpoint's nextXid, and even more wrong to do that only if wal_mode =
hot_standby. Plus, I think it's a smart idea to verify that the required
parts of the CLOG are available at the start of recovery. Because if they're
missing, the data on the standby *will* be corrupted. Is there any argument
against doiing LogStandbySnapshot() earlier (i.e., at the time oldestActiveXid
is computed)?

best regards,
Florian Pflug

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2011-10-27 12:09:19 Re: isolationtester and invalid permutations
Previous Message Robert Haas 2011-10-27 12:00:08 Re: Updated version of pg_receivexlog