Re: Better way of dealing with pgstat wait timeout during buildfarm runs?

From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, michael(at)otacoo(dot)com, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Tomas Vondra <tv(at)fuzzy(dot)cz>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Better way of dealing with pgstat wait timeout during buildfarm runs?
Date: 2014-12-25 20:14:36
Message-ID: 20141225201436.GK31801@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2014-12-25 14:36:42 -0500, Tom Lane wrote:
> I wonder whether when multiple processes are demanding statsfile updates,
> there's some misbehavior that causes them to suck CPU away from the stats
> collector and/or convince it that it doesn't need to write anything.
> There are odd things in the logs in some of these events. For example in
> today's failure on hamster,
> http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hamster&dt=2014-12-25%2016%3A00%3A07
> there are two client-visible wait-timeout warnings, one each in the
> gist and spgist tests. But in the postmaster log we find these in
> fairly close succession:
>
> [549c38ba.724d:2] WARNING: pgstat wait timeout
> [549c39b1.73e7:10] WARNING: pgstat wait timeout
> [549c38ba.724d:3] WARNING: pgstat wait timeout
>
> Correlating these with other log entries shows that the first and third
> are from the autovacuum launcher while the second is from the gist test
> session. So the spgist failure failed to get logged, and in any case the
> big picture is that we had four timeout warnings occurring in a pretty
> short span of time, in a parallel test set that's not all that demanding
> (12 parallel tests, well below our max). Not sure what to make of that.

My guess is that a checkpoint happened at that time. Maybe it'd be a
good idea to make pg_regress start postgres with log_checkpoints
enabled? My guess is that we'd find horrendous 'sync' times.

Michael: Could you perhaps turn log_checkpoints on in the config?

> BTW, I notice that in the current state of pgstat.c, all the logic for
> keeping track of request arrival times is dead code, because nothing is
> actually looking at DBWriteRequest.request_time. This makes me think that
> somebody simplified away some logic we maybe should have kept. But if
> we're going to leave it like this, we could replace the DBWriteRequest
> data structure with a simple OID list and save a fair amount of code.

That's indeed odd. Seems to have been lost when the statsfile was split
into multiple files. Alvaro, Tomas?

I wondered for a second whether the split could be responsible somehow,
but there's reports of that in older backbranches as well:
http://pgbuildfarm.org/cgi-bin/show_log.pl?nm=mereswine&dt=2014-12-23%2019%3A17%3A41

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2014-12-25 20:41:10 Some other odd buildfarm failures
Previous Message Tom Lane 2014-12-25 19:36:42 Re: Better way of dealing with pgstat wait timeout during buildfarm runs?