Re: warm standby server stops doing checkpoints after awhile

From: "Simon Riggs" <simon(at)2ndquadrant(dot)com>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "Frank Wittig" <fw(at)weisshuhn(dot)de>, "Postgres General" <pgsql-general(at)postgresql(dot)org>
Subject: Re: warm standby server stops doing checkpoints after awhile
Date: 2007-06-01 10:58:18
Message-ID: 1180695498.26297.97.camel@silverbirch.site
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Thu, 2007-05-31 at 10:23 -0400, Tom Lane wrote:
> Frank Wittig <fw(at)weisshuhn(dot)de> writes:
> > The problem is that the slave server stops checkpointing after some
> > hours of working (about 24 to 48 hours of conitued log replay).
>
> Hm ... look at RecoveryRestartPoint() in xlog.c. Could there be
> something wrong with this logic?
>
> /*
> * Do nothing if the elapsed time since the last restartpoint is less than
> * half of checkpoint_timeout. (We use a value less than
> * checkpoint_timeout so that variations in the timing of checkpoints on
> * the master, or speed of transmission of WAL segments to a slave, won't
> * make the slave skip a restartpoint once it's synced with the master.)
> * Checking true elapsed time keeps us from doing restartpoints too often
> * while rapidly scanning large amounts of WAL.
> */
> elapsed_secs = time(NULL) - ControlFile->time;
> if (elapsed_secs < CheckPointTimeout / 2)
> return;
>
> The idea is that the slave (once in sync with the master) ought to
> checkpoint every time it sees a checkpoint record in the master's
> output. I'm not seeing a flaw but maybe there is one here, or somewhere
> nearby. Are you sure the master is checkpointing?

Hmmm. This can happen if a backend crashes while half-way through any
set of changes that causes safe_restartpoint() to be true. Or it might
be that one of the Index AMs don't correctly clear the multi-WAL actions
in some corner cases.

Or it could be that the mdsync looping problem has been worse than we
thought and checkpoints have been avoided completely for some time.

Frank,

This is repeatable, yes?
Has anything crashed on your server?
Are you using GIN or GIST indexes?

I'll look at putting some debug information in there that logs whether
multi-WAL actions remain unresolved for any length of time.

Continuing to think about this one....

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Anton 2007-06-01 10:58:48 Re: how to use array with "holes" ?
Previous Message Pavel Stehule 2007-06-01 10:23:00 Re: how to use array with "holes" ?