From: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
---|---|
To: | Alvaro Herrera <alvherre(at)2ndquadrant(dot)com> |
Cc: | Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com> |
Subject: | Re: [COMMITTERS] pgsql: Replication lag tracking for walsenders |
Date: | 2017-04-21 21:04:08 |
Message-ID: | 27895.1492808648@sss.pgh.pa.us |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-committers pgsql-hackers |
Alvaro Herrera <alvherre(at)2ndquadrant(dot)com> writes:
> Simon Riggs wrote:
>> Replication lag tracking for walsenders
>>
>> Adds write_lag, flush_lag and replay_lag cols to pg_stat_replication.
> Did anyone notice that this seems to be causing buildfarm member 'tern'
> to fail the recovery check? See here:
> https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=tern&dt=2017-04-21%2012%3A48%3A09&stg=recovery-check
> which has
> TRAP: FailedAssertion("!(lsn >= prev.lsn)", File: "walsender.c", Line: 3331)
> Line 3331 was added by this commit.
Note that while that commit was some time back, tern has only just started
running recovery-check, following its update to the latest buildfarm
script. It looks like it's run that test four times and failed twice,
so far. So, not 100% reproducible, but there's something rotten there.
Timing-dependent, maybe?
Some excavation in the buildfarm database says that the coverage for
the recovery-check test has been mighty darn thin up until just recently.
These are all the reports we have:
pgbfprod=> select sysname, min(snapshot) as oldest, count(*) from build_status_log where log_stage = 'recovery-check.log' group by 1 order by 2;
sysname | oldest | count
----------+---------------------+-------
hamster | 2016-03-01 02:34:26 | 182
crake | 2017-04-09 01:58:15 | 80
nightjar | 2017-04-11 15:54:34 | 52
longfin | 2017-04-19 16:29:39 | 9
hornet | 2017-04-20 14:12:08 | 4
mandrill | 2017-04-20 14:14:08 | 4
sungazer | 2017-04-20 14:16:08 | 4
tern | 2017-04-20 14:18:08 | 4
prion | 2017-04-20 14:23:05 | 8
jacana | 2017-04-20 15:00:17 | 3
(10 rows)
So, other than hamster which is certainly going to have its own spin
on the timing question, we have next to no track record for this test.
I wouldn't bet that this issue is unique to tern; more likely, that's
just the first critter to show an intermittent issue.
regards, tom lane
From | Date | Subject | |
---|---|---|---|
Next Message | Andres Freund | 2017-04-21 21:45:14 | Re: [COMMITTERS] pgsql: Replication lag tracking for walsenders |
Previous Message | Tom Lane | 2017-04-21 19:56:26 | pgsql: Avoid depending on non-POSIX behavior of fcntl(2). |
From | Date | Subject | |
---|---|---|---|
Next Message | Andres Freund | 2017-04-21 21:45:14 | Re: [COMMITTERS] pgsql: Replication lag tracking for walsenders |
Previous Message | Ilya Roublev | 2017-04-21 20:31:46 | multithreading in Batch/pipelining mode for libpq |