Re: [COMMITTERS] pgsql: Replication lag tracking for walsenders

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
Subject: Re: [COMMITTERS] pgsql: Replication lag tracking for walsenders
Date: 2017-04-21 21:04:08
Message-ID: 27895.1492808648@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-committers pgsql-hackers

Alvaro Herrera <alvherre(at)2ndquadrant(dot)com> writes:
> Simon Riggs wrote:
>> Replication lag tracking for walsenders
>>
>> Adds write_lag, flush_lag and replay_lag cols to pg_stat_replication.

> Did anyone notice that this seems to be causing buildfarm member 'tern'
> to fail the recovery check? See here:

> https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=tern&dt=2017-04-21%2012%3A48%3A09&stg=recovery-check
> which has
> TRAP: FailedAssertion("!(lsn >= prev.lsn)", File: "walsender.c", Line: 3331)

> Line 3331 was added by this commit.

Note that while that commit was some time back, tern has only just started
running recovery-check, following its update to the latest buildfarm
script. It looks like it's run that test four times and failed twice,
so far. So, not 100% reproducible, but there's something rotten there.
Timing-dependent, maybe?

Some excavation in the buildfarm database says that the coverage for
the recovery-check test has been mighty darn thin up until just recently.
These are all the reports we have:

pgbfprod=> select sysname, min(snapshot) as oldest, count(*) from build_status_log where log_stage = 'recovery-check.log' group by 1 order by 2;
sysname | oldest | count
----------+---------------------+-------
hamster | 2016-03-01 02:34:26 | 182
crake | 2017-04-09 01:58:15 | 80
nightjar | 2017-04-11 15:54:34 | 52
longfin | 2017-04-19 16:29:39 | 9
hornet | 2017-04-20 14:12:08 | 4
mandrill | 2017-04-20 14:14:08 | 4
sungazer | 2017-04-20 14:16:08 | 4
tern | 2017-04-20 14:18:08 | 4
prion | 2017-04-20 14:23:05 | 8
jacana | 2017-04-20 15:00:17 | 3
(10 rows)

So, other than hamster which is certainly going to have its own spin
on the timing question, we have next to no track record for this test.
I wouldn't bet that this issue is unique to tern; more likely, that's
just the first critter to show an intermittent issue.

regards, tom lane

In response to

Responses

Browse pgsql-committers by date

  From Date Subject
Next Message Andres Freund 2017-04-21 21:45:14 Re: [COMMITTERS] pgsql: Replication lag tracking for walsenders
Previous Message Tom Lane 2017-04-21 19:56:26 pgsql: Avoid depending on non-POSIX behavior of fcntl(2).

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2017-04-21 21:45:14 Re: [COMMITTERS] pgsql: Replication lag tracking for walsenders
Previous Message Ilya Roublev 2017-04-21 20:31:46 multithreading in Batch/pipelining mode for libpq