Re: Hot Backup with rsync fails at pg_clog if under load

From: Florian Pflug <fgp(at)phlo(dot)org>
To: Chris Redekop <chris(at)replicon(dot)com>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, Daniel Farina <daniel(at)heroku(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Hot Backup with rsync fails at pg_clog if under load
Date: 2011-10-26 15:59:59
Message-ID: 01A2E400-AC30-48D5-8DD7-9F789D736CE7@phlo.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Oct26, 2011, at 17:36 , Chris Redekop wrote:
> > And I think they also reported that if they didn't run hot standby,
> > but just normal recovery into a new master, it didn't have the problem
> > either, i.e. without hotstandby, recovery ran, properly extended the
> > clog, and then ran as a new master fine.
>
> Yes this is correct...attempting to start as hotstandby will produce the
> pg_clog error repeatedly and then without changing anything else, just
> turning hot standby off it will start up successfully.

Yup, because with hot standby disabled (on the client side), StartupCLOG()
happens after recovery has completed. That, at the very least, makes the
problem very unlikely to occur in the non-hot-standby case. I'm not sure
it's completely impossible, though.

Per my theory about the cause of the problem in my other mail, I think you
might see StartupCLOG failures even during crash recovery, provided that
wal_level was set to hot_standby when the primary crashed. Here's how

1) We start a checkpoint, and get as far as LogStandbySnapshot()
2) A backend does AssignTransactionId, and gets as far as GetTransactionoId().
The assigned XID requires CLOG extension.
3) The checkpoint continues, and LogStandbySnapshot () advances the
checkpoint's nextXid to the XID assigned in (2).
4) We crash after writing the checkpoint record, but before the CLOG
extension makes it to the disk, and before any trace of the XID assigned
in (2) makes it to the xlog.

Then StartupCLOG() would fail at the end of recovery, because we'd end up
with a nextXid whose corresponding CLOG page doesn't exist.

> > This fits the OP's observation ob the
> > problem vanishing when pg_start_backup() does an immediate checkpoint.
>
> Note that this is *not* the behaviour I'm seeing....it's possible it happens
> more frequently without the immediate checkpoint, but I am seeing it happen
> even with the immediate checkpoint.

Yeah, I should have said "of the problem's likelihood decreasing" instead
of "vanishing". The point is, the longer the checkpoint takes, the higher
the chance the nextId is advanced far enough to require a CLOG extension.

That alone isn't enough to trigger the error - the CLOG extension must also
*not* make it to the disk before the checkpoint completes - but it's
a required precondition for the error to occur.

best regards,
Florian Pflug

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Simon Riggs 2011-10-26 16:08:45 Re: Hot Backup with rsync fails at pg_clog if under load
Previous Message Heikki Linnakangas 2011-10-26 15:52:28 Re: Range Types - typo + NULL string constructor