Re: [COMMITTERS] pgsql: Replication lag tracking for walsenders

From: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [COMMITTERS] pgsql: Replication lag tracking for walsenders
Date: 2017-04-23 02:11:40
Message-ID: CAEepm=1J1PxBjUNthkjc__mZLgO4T-huK6tSoSAdCz+vuy2Y5A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-committers pgsql-hackers

On Sun, Apr 23, 2017 at 3:41 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com> writes:
>> The assertion fails reliably for me, because standby2's reported write
>> LSN jumps backwards after the timeline changes: for example I see
>> 3020000 then 3028470 then 3020000 followed by a normal progression.
>> Surprisingly, 004_timeline_switch.pl reports success anyway. I'm not
>> sure why the test fails sometimes on tern, but you can see that even
>> when it passed on tern the assertion had failed.
>
> Whoa. This just turned into a much larger can of worms than I expected.
> How can it be that processes are getting assertion crashes and yet the
> test framework reports success anyway? That's impossibly
> broken/unacceptable.

Agreed, thanks for fixing that.

> Looking closer at the tern report we started the thread with, there
> are actually TWO assertion trap reports, the one Alvaro noted and
> another one in 009_twophase_master.log:
>
> TRAP: FailedAssertion("!(*ptr == ((TransactionId) 0) || (*ptr == parent && overwriteOK))", File: "subtrans.c", Line: 92)

I see you started another thread for that one. I admit I spent a
couple of hours trying to figure this out before I saw your email, but
I was looking at the wrong bit of git history and didn't spot that
it's likely a 7 year old problem. So this is a good result for these
TAP tests, despite teething difficulties with, erm, "pass" vs "fail"
and the fact that 009_twophase.pl was bombing from the moment it was
committed. Hoping to use this framework in future work.

>> Here is a fix for the assertion failure.
>
> As for this patch itself, is it reasonable to try to assert that the
> timeline has in fact changed?

The protocol doesn't include the timeline in reply messages, so it's
not clear how the upstream server would know what timeline the standby
thinks it's dealing with in any given reply message. The sending
server has its own idea of the current timeline but it's not in sync
with the stream of incoming replies.

--
Thomas Munro
http://www.enterprisedb.com

In response to

Responses

Browse pgsql-committers by date

  From Date Subject
Next Message Mark Dilger 2017-04-23 02:44:28 Re: [COMMITTERS] pgsql: Replication lag tracking for walsenders
Previous Message Tom Lane 2017-04-22 22:18:37 pgsql: Make PostgresNode.pm check server status more carefully.

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2017-04-23 02:31:05 Re: A note about debugging TAP failures
Previous Message Petr Jelinek 2017-04-23 01:15:40 Re: logical replication and PANIC during shutdown checkpoint in publisher