Recent eelpout failures on 9.x branches

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc: pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Recent eelpout failures on 9.x branches
Date: 2020-12-01 22:36:13
Message-ID: 1530182.1606862173@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

For about a week, eelpout has been failing the pg_basebackup test
more often than not, but only in the 9.5 and 9.6 branches:

https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=eelpout&br=REL9_6_STABLE
https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=eelpout&br=REL9_5_STABLE

The failures all look pretty alike:

# Running: pg_basebackup -D /home/tmunro/build-farm/buildroot/REL9_6_STABLE/pgsql.build/src/bin/pg_basebackup/tmp_check/tmp_test_jJOm/backupxs -X stream
pg_basebackup: could not send copy-end packet: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
pg_basebackup: child process exited with exit code 1
not ok 44 - pg_basebackup -X stream runs

What shows up in the postmaster log is

2020-12-02 09:04:53.064 NZDT [29536:1] [unknown] LOG: connection received: host=[local]
2020-12-02 09:04:53.065 NZDT [29536:2] [unknown] LOG: replication connection authorized: user=tmunro
2020-12-02 09:04:53.175 NZDT [29537:1] [unknown] LOG: connection received: host=[local]
2020-12-02 09:04:53.178 NZDT [29537:2] [unknown] LOG: replication connection authorized: user=tmunro
2020-12-02 09:05:42.860 NZDT [29502:2] LOG: using stale statistics instead of current ones because stats collector is not responding
2020-12-02 09:05:53.074 NZDT [29542:1] LOG: using stale statistics instead of current ones because stats collector is not responding
2020-12-02 09:05:53.183 NZDT [29537:3] pg_basebackup LOG: terminating walsender process due to replication timeout
2020-12-02 09:05:53.183 NZDT [29537:4] pg_basebackup LOG: disconnection: session time: 0:01:00.008 user=tmunro database= host=[local]
2020-12-02 09:06:33.996 NZDT [29536:3] pg_basebackup LOG: disconnection: session time: 0:01:40.933 user=tmunro database= host=[local]

The "using stale statistics" gripes seem to be from autovacuum, so they
may be unrelated to the problem; but they suggest that the system
is under very heavy load, or else that there's some kernel-level issue.
Note however that some of the failures don't have those messages, and
I also see those messages in some runs that didn't fail.

Perhaps this is just a question of the machine being too slow to complete
the test, in which case we ought to raise wal_sender_timeout. But it's
weird that it would've started to fail just now, because I don't really
see any changes in those branches that would explain a week-old change
in the test runtime.

Any thoughts?

regards, tom lane

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2020-12-01 22:39:24 Re: Setof RangeType returns
Previous Message Chapman Flack 2020-12-01 22:31:29 Re: Setof RangeType returns