Quick Links

Re: Hot Backup with rsync fails at pg_clog if under load

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Florian Pflug <fgp(at)phlo(dot)org>
Cc:	Aidan Van Dyk <aidan(at)highrise(dot)ca>, Daniel Farina <daniel(at)heroku(dot)com>, Chris Redekop <chris(at)replicon(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hot Backup with rsync fails at pg_clog if under load
Date:	2011-10-27 13:51:00
Message-ID:	CA+U5nM+7nXEceaFQ9rpKEP45TV0ZzdEfuqdnkaWfBcB4bu_=cA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Thu, Oct 27, 2011 at 12:29 AM, Florian Pflug <fgp(at)phlo(dot)org> wrote:

> Per my theory about the cause of the problem in my other mail, I think you
> might see StartupCLOG failures even during crash recovery, provided that
> wal_level was set to hot_standby when the primary crashed. Here's how
>
> 1) We start a checkpoint, and get as far as LogStandbySnapshot()
> 2) A backend does AssignTransactionId, and gets as far as GetTransactionoId().
> The assigned XID requires CLOG extension.
> 3) The checkpoint continues, and LogStandbySnapshot () advances the
> checkpoint's nextXid to the XID assigned in (2).
> 4) We crash after writing the checkpoint record, but before the CLOG
> extension makes it to the disk, and before any trace of the XID assigned
> in (2) makes it to the xlog.
>
> Then StartupCLOG() would fail at the end of recovery, because we'd end up
> with a nextXid whose corresponding CLOG page doesn't exist.

Clog extension holds XidGenLock, as does LogStandbySnapshot, which
specifically excludes the above scenario.

> Quite aside from that concern, I think it's probably not a good idea
> for the nextXid value of a checkpoint to depend on whether wal_level
> was set to hot_standby or not. Our recovery code is already quite complex
> and hard to test, and this just adds one more combination that has to
> be thought about while coding and that needs to be tested.
>
> My suggestion is to fix the CLOG problem in that same way that you fixed
> the SUBTRANS problem, i.e. by moving LogStandbySnapshot() to before
> CheckPointGuts().
>
> Here's what I image CreateCheckPoint() should look like:
>
> 1) LogStandbySnapshot() and fill out oldestActiveXid
> 2) Fill out REDO
> 3) Wait for concurrent commits
> 4) Fill out nextXid and the other fields
> 5) CheckPointGuts()
> 6) Rest
>
> It's then no longer necessary for LogStandbySnapshot() do modify
> the nextXid, since we fill out nextXid after LogStandbySnapshot() and
> will thus derive a higher value than LogStandbySnapshot() would have.
>
> We could then also fold GetOldestActiveTransactionId() back into
> your proposed LogStandbySnapshot() and thus don't need two ProcArray
> traversals.

I think you make a good case for doing this.

However, I'm concerned that moving LogStandbySnapshot() in a backpatch
seems more risky than it's worth. We could easily introduce a new bug
into what we would all agree is a complex piece of code. Minimal
change seems best in this case.

And also, 2 ProcArray traversals is not a problem there.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

In response to

Re: Hot Backup with rsync fails at pg_clog if under load at 2011-10-26 23:29:49 from Florian Pflug

Responses

Re: Hot Backup with rsync fails at pg_clog if under load at 2011-10-27 14:03:13 from Florian Pflug

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Florian Pflug	2011-10-27 14:03:13	Re: Hot Backup with rsync fails at pg_clog if under load
Previous Message	Simon Riggs	2011-10-27 13:36:54	Re: Hot Backup with rsync fails at pg_clog if under load