Re: Avoiding adjacent checkpoint records

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Re: Avoiding adjacent checkpoint records
Date: 2012-06-08 17:07:53
Message-ID: CA+TgmoYZp3ngTHOv1T15WvmahGjQ5j2Q_=0CKUf03R7SJmjn7g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Jun 8, 2012 at 12:24 PM, Kevin Grittner
<Kevin(dot)Grittner(at)wicourts(dot)gov> wrote:
> I haven't been exactly clear on the risks about which Tom and Robert
> have been concerned; is it a question about whether we change the
> meaning of these settings to something more complicated?:
>
> checkpoint_segments (integer)
>    Maximum number of log file segments between automatic WAL
>    checkpoints
>
> checkpoint_timeout (integer)
>    Maximum time between automatic WAL checkpoints

The issue is that, in the tip of the 9.2 branch, checkpoint_timeout is
no longer the maximum time between automatic WAL checkpoints.
Instead, the checkpoint is skipped if we're still in the same WAL
segment that we were in when we did the last checkpoint. Therefore,
there is absolutely no upper bound on the amount of time that can pass
between checkpoints. If someone does one transaction, which happens
not to cross a WAL segment boundary, we will never automatically
checkpoint that transaction. A checkpoint will ONLY be triggered when
we have enough write-ahead log volume to get us into the next segment.
I am arguing (and Tom is now agreeing) that this is bad, and that the
patch which made this change needs either some kind of fix, or to be
reverted completely.

The original motivation for the patch was that the code to suppress
duplicate checkpoints stopped working correctly when Hot Standby was
committed. The previous coding (before the commit at issue) skips a
checkpoint if no write-ahead log records at all have been emitted
since the start of the preceding checkpoint. I believe this is the
correct behavior, but there's a problem: when wal_level =
hot_standby, we emit an XLOG_RUNNING_XACTS record during every
checkpoint cycle. So, if wal_level = hot_standby, the test for
whether anything has happened always returns false, and so the system
never quiesces: every checkpoint cycle contains at least the
XLOG_RUNNING_XACTS record, even if nothing else, so we never get to
skip any checkpoints. When wal_level < hot_standby, the problem does
not exist and redundant checkpoints are suppressed just as we would
hope.

While Simon's patch does fix the problem, I believe that making
checkpoint_timeout anything less than a hard timeout is unwise. The
previous behavior - at least one checkpoint per checkpoint_timeout -
is easy to understand and plan for; I believe the new behavior will be
an unpleasant surprise for users who care about checkpointing
regularly, which I think most do, whether they are here to be
represented in this conversation or not. So I think we need a
different fix for the problem that wal_level = hot_standby defeats the
redundant-checkpoint-detection code.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit kapila 2012-06-08 17:14:36 WIP patch for Todo Item : Provide fallback_application_name in contrib/pgbench, oid2name, and dblink
Previous Message Florian Pflug 2012-06-08 17:01:12 Re: Checkpointer on hot standby runs without looking checkpoint_segments