Re: Better way of dealing with pgstat wait timeout during buildfarm runs?

From: Matt Kelly <mkellycs(at)gmail(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
Cc: PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Better way of dealing with pgstat wait timeout during buildfarm runs?
Date: 2015-01-22 03:43:03
Message-ID: CA+KcUki1DwxqbBPt8ELyVDtS4JrK=hmiEwOS5+6LqtU-MhrMdg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

>
> Sure, but nobody who is not a developer is going to care about that.
> A typical user who sees "pgstat wait timeout", or doesn't, isn't going
> to be able to make anything at all out of that.

As a user, I wholeheartedly disagree.

That warning helped me massively in diagnosing an unhealthy database server
in the past at TripAdvisor (i.e. high end server class box, not a raspberry
pie). I have realtime monitoring that looks at pg_stat_database at regular
intervals particularly for the velocity of change of xact_commit and
xact_rollback columns, similar to how check_postgres does it.
https://github.com/bucardo/check_postgres/blob/master/check_postgres.pl#L4234

When one of those servers was unhealthy, it stopped reporting statistics
for 30 seconds+ at a time. My dashboard which polled far more frequently
than that indicated the server was normally processing 0 tps with
intermittent spikes. I went directly onto the server and sampled
pg_stat_database. That warning was the only thing that directly indicated
that the statistics collector was not to be trusted. It obviously was a
victim of what was going on in the server, but its pretty important to know
when your methods for measuring server health are lying to you. The spiky
TPS at first glance appears like some sort of live lock, not just that the
server is overloaded.

Now, I know: 0 change in stats = collector broken. Rereading the docks,

Also, the collector itself emits a new report at most once per
> PGSTAT_STAT_INTERVAL milliseconds (500 ms unless altered while building
> the server).

Without context this merely reads: "We sleep for 500ms, plus the time to
write the file, plus whenever the OS decides to enforce the timer
interrupt... so like 550-650ms." It doesn't read, "When server is
unhealthy, but _still_ serving queries, the stats collector might not be
able to keep up and will just stop reporting stats all together."

I think the warning is incredibly valuable. Along those lines I'd also
love to see a pg_stat_snapshot_timestamp() for monitoring code to use to
determine if its using a stale snapshot, as well as helping to smooth
graphs of the statistics that are based on highly granular snapshotting.

- Matt Kelly

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2015-01-22 05:00:50 Re: Parallel Seq Scan
Previous Message Amit Kapila 2015-01-22 03:02:55 Re: TODO : Allow parallel cores to be used by vacuumdb [ WIP ]