Re: Recovery inconsistencies, standby much larger than primary

From: Greg Stark <stark(at)mit(dot)edu>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Andres Freund <andres(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Recovery inconsistencies, standby much larger than primary
Date: 2014-02-06 21:22:01
Message-ID: CAM-w4HPvJCBRVV3dXg8aj0WzkU08dHuX-XYbfDYQhNrn5bnTQg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Feb 3, 2014 at 12:02 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> What version were you running before 9.1.11 exactly? I took a look
> through all the diffs from 9.1.9 up to 9.1.11, and couldn't find any
> changes that seemed even vaguely related to this. There are some
> changes in known-transaction tracking, but it's hard to see a connection
> there. Most of the other diffs are in code that wouldn't execute during
> WAL replay at all.

Both the primary and the standby were 9.1.11 from the get-go. The
database the primary was forked off of was 9.1.10 but as far as I can
tell the primary in the current pair has no problems.

What's worse is we created a new standby from the same base backup and
replayed the same records and it didn't reproduce the problem. This
means either it's a hardware problem -- but we've seen it on multiple
standbys on this database and at least one other database which is in
a different data centre -- or it's a race condition --but that's hard
to credit in the recovery code which is basically single-threaded.

And these records are from before the standby reaches a consistency so
it's hard to see how a connection from a hot standby client could
cause any kind of race condition. The only other thread that could
conceivably cause a heisenbug is the bgwriter. It's hard to imagine
how a race condition in there could be so easy to hit that it would
happen four times on one restore but otherwise go mostly unnoticed.
--
greg

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2014-02-06 21:35:07 Re: mvcc catalo gsnapshots and TopTransactionContext
Previous Message Stefan Kaltenbrunner 2014-02-06 19:49:35 Re: [DOCS] Re: Viability of text HISTORY/INSTALL/regression README files (was Re: [COMMITTERS] pgsql: Document a few more regression test hazards.)