Re: last_statrequest is in the future

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: pgsql-hackers(at)postgresql(dot)org, ohp(at)pyrenet(dot)fr
Subject: Re: last_statrequest is in the future
Date: 2010-03-25 16:45:40
Message-ID: 19946.1269535540@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I wrote:
> ... lookee what we have here:
> http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=jaguar&dt=2010-03-24%2004:00:07

Modifying the code to print the actual timestamps resulted in a wealth
of information that can be dredged from the postmaster log here:
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=jaguar&dt=2010-03-25%2004:00:04
Just grep for log messages like
[4baafb13.cd2:1] LOG: last_statrequest 2010-03-25 06:58:42.118734+01 is later than collector's time 2010-03-25 06:58:40.817027+01

What this shows is that system timekeeping on jaguar is just utterly
broken. gettimeofday() sometimes returns answers that are 5 seconds or
more skewed in different processes, and there are episodes where its
value fails to move at all in the stats collector while multiple seconds
elapse in the opinion of another process.

I wonder whether this doesn't also explain jaguar's propensity to show
some other unexplainable failures, like these recent examples:
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=jaguar&dt=2010-03-20%2004:00:03
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=jaguar&dt=2010-03-15%2004:00:04
which are not at all unique in that machine's buildfarm history.

I had always supposed that these were caused by the extreme overall
slowness induced by CLOBBER_CACHE_ALWAYS, but now I think they are
more likely explained by system timekeeping issues. In particular the
last-cited regression failure can be explained if now() fails to change
value across pg_sleep(0.1), and that's isomorphic to what seems to be
happening in the stats collector in some of the episodes captured today.

In short: either this machine has got broken clock hardware, or there's
a very serious bug in the kernel code for reading the clock.

regards, tom lane

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Josh Berkus 2010-03-25 17:41:34 Conditional trigger missing from pg_dump
Previous Message Alvaro Herrera 2010-03-25 15:11:38 Re: ALTER ROLE/DATABASE RESET ALL versus security