Re: [COMMITTERS] pgsql: Replication lag tracking for walsenders

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [COMMITTERS] pgsql: Replication lag tracking for walsenders
Date: 2017-04-22 15:41:01
Message-ID: 4219.1492875661@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-committers pgsql-hackers

Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com> writes:
> The assertion fails reliably for me, because standby2's reported write
> LSN jumps backwards after the timeline changes: for example I see
> 3020000 then 3028470 then 3020000 followed by a normal progression.
> Surprisingly, 004_timeline_switch.pl reports success anyway. I'm not
> sure why the test fails sometimes on tern, but you can see that even
> when it passed on tern the assertion had failed.

Whoa. This just turned into a much larger can of worms than I expected.
How can it be that processes are getting assertion crashes and yet the
test framework reports success anyway? That's impossibly
broken/unacceptable.

Looking closer at the tern report we started the thread with, there
are actually TWO assertion trap reports, the one Alvaro noted and
another one in 009_twophase_master.log:

TRAP: FailedAssertion("!(*ptr == ((TransactionId) 0) || (*ptr == parent && overwriteOK))", File: "subtrans.c", Line: 92)

When I run the recovery test on my own machine, it reports success
(quite reliably, I tried a bunch of times yesterday), but now that
I know to look:

$ grep TRAP tmp_check/log/*
tmp_check/log/009_twophase_master.log:TRAP: FailedAssertion("!(*ptr == ((TransactionId) 0) || (*ptr == parent && overwriteOK))", File: "subtrans.c", Line: 92)

So we now have three problems not just one:

* How is it that the TAP tests aren't noticing the failure? This one,
to my mind, is a code-red situation, as it basically invalidates every
TAP test we've ever run.

* If Thomas's explanation for the timeline-switch assertion is correct,
why isn't it reproducible everywhere?

* What's with that second TRAP?

> Here is a fix for the assertion failure.

As for this patch itself, is it reasonable to try to assert that the
timeline has in fact changed?

regards, tom lane

In response to

Responses

Browse pgsql-committers by date

  From Date Subject
Next Message Tom Lane 2017-04-22 15:59:37 Re: [COMMITTERS] pgsql: Replication lag tracking for walsenders
Previous Message Andrew Dunstan 2017-04-22 14:28:49 pgsql: Require sufficiently modern version of Test::More for TAP tests

Browse pgsql-hackers by date

  From Date Subject
Next Message Masahiko Sawada 2017-04-22 15:51:52 Re: Interval for launching the table sync worker
Previous Message Michael Paquier 2017-04-22 14:31:58 Re: Small patch for pg_basebackup argument parsing