Re: Assertion failure at standby promotion

From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Assertion failure at standby promotion
Date: 2013-05-07 18:34:42
Message-ID: 51894942.4080500@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 06.05.2013 13:08, Heikki Linnakangas wrote:
> On 03.05.2013 18:17, Fujii Masao wrote:
>> Hi,
>>
>> I got the following assertion failure when I promoted the standby.
>>
>> 2013-05-04 00:12:31 JST sby1 LOG: received promote request
>> 2013-05-04 00:12:31 JST sby1 FATAL: terminating walreceiver process
>> due to administrator command
>> 2013-05-04 00:12:31 JST sby1 LOG: redo done at 0/6FFE038
>> 2013-05-04 00:12:31 JST sby1 LOG: last completed transaction was at
>> log time 2013-05-04 00:12:25.878909+09
>> 2013-05-04 00:12:31 JST sby1 LOG: selected new timeline ID: 2
>> 2013-05-04 00:12:31 JST sby1 LOG: archive recovery complete
>> 2013-05-04 00:12:31 JST sby1 LOG: checkpoint starting:
>> TRAP: FailedAssertion("!(sentPtr<= sendTimeLineValidUpto)", File:
>> "walsender.c", Line: 1465)
>> 2013-05-04 00:12:31 JST sby1 LOG: autovacuum launcher started
>>
>> The way to reproduce this is:
>>
>> 1. Create one master A, one standby B, and one cascade standby C.
>> 2. Run pgbench -i -s 10
>> 3. Promote the standby B before pgbench -i finishes
>
> I was able to reproduce this. The assertion checks that if the system is
> promoted at WAL location X, we must not have already sent WAL at > X to
> the client. As the code stands, that assumption is wrong; the walsender
> will merrily stream WAL that hasn't been replayed yet, and the system
> can be promoted before replaying all the WAL that has been streamed to a
> cascading standby. The comment in GetStandbyFlushRecPtr(), which is the
> function that determined how far the WAL may be streamed to a cascading
> standby, says this:
>
>> /*
>> * We can safely send what's already been replayed. Also, if walreceiver
>> * is streaming WAL from the same timeline, we can send anything that
>> * it has streamed, but hasn't been replayed yet.
>> */
>
> There seems to be two bugs here:
>
> 1. This used to work in 9.2, because the startup process would always
> replay all the WAL present in pg_xlog before promoting (the WAL present
> in pg_xlog was streamed from master). But the refactorings in xlog.c in
> 9.3devel broke that, so that the startup process can promote earlier.
>
> 2. Even after fixing the logic in xlog.c, there is still a corner-case
> where the startup process can promote before all the WAL that has been
> received from walreceiver has been received. That happens if the WAL
> streaming is terminated at a page boundary, rather than at a record
> boundary. For example, we might have received WAL up to the page
> boundary at 0/5BFA000, but the last *complete* record that we have
> received ends at 0/5BF9BD8.
>
> To fix the second issue, I think two things need to happen. First, we
> need to suppress the check in walsender. Second, we need to teach the
> WAL replay to back off when that happens. At the moment, the replay in
> the cascading standby gets stuck, trying to fetch the next page
> containing rest of the partial WAL record. Instead, it should throw away
> the partial record it has, and resync at the end of the last replayed
> record.

I came up with the attached patch for this
(fix-standby-promotion-assert-fail-2.patch). You will also need to apply
fast-promotion-quick-fix.patch to work around the bug in fast promotion
I reported here:
http://www.postgresql.org/message-id/5188CFFA.3020209@vmware.com.

pg_receivexlog has a variant of the same bug. If the server has sent WAL
up to segment boundary, and the segment boundary splits a WAL record in
half, and the server is then promoted so that the promotion checkpoint
(or end-of-recovery) record goes to the previous segment, pg_receivexlog
will not fetch the segment containing the checkpoint record correctly.
When following a timeline switch, it should restart streaming from the
new timeline at the point where the timeline switch occurred, rather
than at the point on the old timeline where it stopped. Usually it's the
same thing, but not if the streaming was paused at a page or segment
boundary.

To fix this for pg_receivexlog, I added the start position of the next
timeline to the result set that the server sends at end of WAL
streaming. Previously, it only sent the TLI of the next timeline, and it
was assumed that the starting point is the same as the point where
streaming was stopped.

It's a bit late to change it, but I have to say I don't like this whole
business of relaying WAL to a cascading standby that hasn't been
replayed. It would be a lot simpler if you could assume that whatever
gets streamed downstream has already been replayed upstream. I
understand that it's nice from a performance point of view, and because
a cascading standby isn't then lagged behind if the replay gets stalled
in the upstream server because of a hot standby conflict. But still..

I'm going to sleep over this and continue testing tomorrow. Please have
a look.

> I think 9.2 has the same bug, BTW. Without support for timeline
> switches over streaming replication, it was just more difficult to hit.

Looking closer, I believe 9.2 is OK. Recovery loops back to retry the
whole record correctly, it was just the timeline switch over streaming
replication code that got confused.

- Heikki

Attachment Content-Type Size
fix-standby-promotion-assert-fail-2.patch text/x-diff 15.2 KB
fast-promotion-quick-fix.patch text/x-diff 433 bytes

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Ian Lawrence Barwick 2013-05-07 18:36:41 Re: \watch stuck on execution of commands returning no tuples
Previous Message Robert Haas 2013-05-07 18:14:51 Re: \watch stuck on execution of commands returning no tuples