Re: Problem with pgstat timneouts

From: "Benjamin Krajmalnik" <kraj(at)servoyant(dot)com>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "pgsql-admin" <pgsql-admin(at)postgresql(dot)org>
Subject: Re: Problem with pgstat timneouts
Date: 2011-12-22 18:25:46
Message-ID: F4E6A2751A2823418A21D4A160B689888CA5AE@fletch.stackdump.local
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-admin

Thanks, Tom.
The system is running 16 cores (dual E5620's), 96GB RAM.
Nothing changed on the system when this started happening - at least
nothing in the configuration - it has not been touched since it was set
up.
Load averages are in between 6 and 8 (all 3 of them)
Memory wise, I do not see a problem - plenty of free memory, and only
120K of swap has been used.
The only thing which I see when I run iostat is load on the drives every
so often. Mifd0 is the database, mifd1 is the pg_xlog.

db1# iostat -x 5
extended device statistics
device r/s w/s kr/s kw/s wait svc_t %b
ad4 0.0 0.0 0.0 0.0 0 1.6 0
ad6 0.0 0.0 0.0 0.0 0 0.8 0
mfid0 11.6 172.2 171.7 2988.5 0 178.1 6
mfid1 0.1 734.0 1.1 5898.8 0 0.2 4
ar0 0.3 0.7 1.0 10.3 0 114.7 0
extended device statistics
device r/s w/s kr/s kw/s wait svc_t %b
ad4 0.0 0.0 0.0 0.0 0 0.0 0
ad6 0.0 0.0 0.0 0.0 0 0.0 0
mfid0 2.8 73.4 22.4 1961.9 0 0.2 2
mfid1 0.0 1784.9 0.0 14303.2 0 0.0 5
ar0 0.0 1.0 0.0 6.8 0 0.8 0
extended device statistics
device r/s w/s kr/s kw/s wait svc_t %b
ad4 0.0 0.0 0.0 0.0 0 0.0 0
ad6 0.0 0.0 0.0 0.0 0 0.0 0
mfid0 2.2 652.7 44.8 7671.1 2053 99.6 10
mfid1 0.0 1525.9 0.0 12207.6 1 0.1 12
ar0 0.0 0.0 0.0 0.0 0 0.0 0
extended device statistics
device r/s w/s kr/s kw/s wait svc_t %b
ad4 0.0 0.0 0.0 0.0 0 0.0 0
ad6 0.0 0.0 0.0 0.0 0 0.0 0
mfid0 1.0 2202.4 8.0 19776.3 0 632.0 68
mfid1 0.0 574.3 0.0 4666.4 0 0.5 23
ar0 0.0 0.0 0.0 0.0 0 0.0 0
extended device statistics
device r/s w/s kr/s kw/s wait svc_t %b
ad4 0.0 0.0 0.0 0.0 0 0.0 0
ad6 0.0 0.0 0.0 0.0 0 0.0 0
mfid0 1.8 75.8 14.4 2242.8 0 0.3 2
mfid1 0.0 1722.5 0.0 13802.0 0 0.0 5
ar0 0.0 0.0 0.0 0.0 0 0.0 0
extended device statistics
device r/s w/s kr/s kw/s wait svc_t %b
ad4 0.0 0.0 0.0 0.0 0 0.0 0
ad6 0.0 0.0 0.0 0.0 0 0.0 0
mfid0 1.4 71.4 11.2 1991.6 0 0.2 1
mfid1 0.0 1708.9 0.0 13671.4 0 0.0 4
ar0 0.2 2.4 3.2 38.4 0 2.5 0
extended device statistics
device r/s w/s kr/s kw/s wait svc_t %b
ad4 0.0 0.0 0.0 0.0 0 0.0 0
ad6 0.0 0.0 0.0 0.0 0 0.0 0
mfid0 1.0 78.0 8.0 2222.5 0 0.1 1
mfid1 0.0 1541.0 0.0 12352.1 0 0.0 5
ar0 0.0 0.6 0.0 9.6 0 0.6 0
extended device statistics
device r/s w/s kr/s kw/s wait svc_t %b
ad4 0.0 0.0 0.0 0.0 0 0.0 0
ad6 0.0 0.0 0.0 0.0 0 0.0 0
mfid0 1.4 56.6 11.2 1629.1 0 0.3 2
mfid1 0.0 1558.3 0.0 12468.3 0 0.0 6
ar0 0.0 0.6 0.0 3.2 0 0.5 0
extended device statistics
device r/s w/s kr/s kw/s wait svc_t %b
ad4 0.0 0.0 0.0 0.0 0 0.0 0
ad6 0.0 0.0 0.0 0.0 0 0.0 0
mfid0 1.8 72.2 14.4 1706.7 0 0.2 1
mfid1 0.0 1696.5 0.0 13596.2 0 0.0 5
ar0 0.0 0.0 0.0 0.0 0 0.0 0
extended device statistics
device r/s w/s kr/s kw/s wait svc_t %b
ad4 0.0 0.0 0.0 0.0 0 0.0 0
ad6 0.0 0.0 0.0 0.0 0 0.0 0
mfid0 0.0 1757.7 0.0 16093.5 351 648.0 48
mfid1 0.0 503.9 0.0 4031.3 0 0.2 10
ar0 0.0 0.0 0.0 0.0 0 0.0 0
extended device statistics
device r/s w/s kr/s kw/s wait svc_t %b
ad4 0.0 0.0 0.0 0.0 0 0.0 0
ad6 0.0 0.0 0.0 0.0 0 0.0 0
mfid0 0.0 125.0 0.0 2542.5 0 404.4 1
mfid1 0.0 1743.9 0.0 13951.1 0 0.1 5
ar0 0.0 0.0 0.0 0.0 0 0.0 0
extended device statistics
device r/s w/s kr/s kw/s wait svc_t %b
ad4 0.0 0.0 0.0 0.0 0 0.0 0
ad6 0.0 0.0 0.0 0.0 0 0.0 0
mfid0 1.2 73.4 9.6 2122.6 0 0.1 1
mfid1 0.0 1624.3 0.0 13020.0 0 0.0 4
ar0 0.0 2.0 0.0 32.0 0 1.8 0
extended device statistics
device r/s w/s kr/s kw/s wait svc_t %b
ad4 0.0 0.0 0.0 0.0 0 0.0 0
ad6 0.0 0.0 0.0 0.0 0 0.0 0
mfid0 0.2 67.4 1.6 1955.8 0 0.1 0
mfid1 0.0 1334.5 0.0 10699.7 0 0.0 5
ar0 0.0 0.4 0.0 6.4 0 0.5 0
extended device statistics
device r/s w/s kr/s kw/s wait svc_t %b
ad4 0.0 0.0 0.0 0.0 0 0.0 0
ad6 0.0 0.0 0.0 0.0 0 0.0 0
mfid0 1.0 64.6 8.0 1789.1 0 0.2 1
mfid1 0.0 1541.0 0.0 12352.1 0 0.0 5
ar0 0.0 0.0 0.0 0.0 0 0.0 0
extended device statistics
device r/s w/s kr/s kw/s wait svc_t %b
ad4 0.0 0.0 0.0 0.0 0 0.0 0
ad6 0.0 0.0 0.0 0.0 0 0.0 0
mfid0 1.0 78.4 8.0 1855.3 0 0.2 1
mfid1 0.0 1652.1 0.0 13217.0 0 0.0 5
ar0 0.0 0.8 0.0 6.4 0 0.7 0
extended device statistics
device r/s w/s kr/s kw/s wait svc_t %b
ad4 0.0 0.0 0.0 0.0 0 0.0 0
ad6 0.0 0.0 0.0 0.0 0 0.0 0
mfid0 0.6 691.6 4.8 7919.5 3461 82.4 11
mfid1 0.0 1477.7 0.0 11821.7 1 0.3 15
ar0 0.0 0.0 0.0 0.0 0 0.0 0
extended device statistics
device r/s w/s kr/s kw/s wait svc_t %b
ad4 0.0 0.0 0.0 0.0 0 0.0 0
ad6 0.0 0.0 0.0 0.0 0 0.0 0
mfid0 0.4 1346.2 3.2 12558.1 0 855.1 41
mfid1 0.0 1020.8 0.0 8190.6 0 0.5 39
ar0 0.0 0.0 0.0 0.0 0 0.0 0
extended device statistics
device r/s w/s kr/s kw/s wait svc_t %b
ad4 0.0 0.0 0.0 0.0 0 0.0 0
ad6 0.0 0.0 0.0 0.0 0 0.0 0
mfid0 1.0 62.2 8.0 1792.6 0 0.3 2
mfid1 0.0 1636.3 0.0 13112.4 0 0.0 4
ar0 0.0 0.8 0.0 12.8 0 0.8 0
extended device statistics
device r/s w/s kr/s kw/s wait svc_t %b
ad4 0.0 0.0 0.0 0.0 0 0.0 0
ad6 0.0 0.0 0.0 0.0 0 0.0 0
mfid0 0.2 63.0 1.6 1818.1 0 0.1 0
mfid1 0.0 1710.6 0.0 13684.6 0 0.0 5
ar0 0.0 1.2 0.0 19.2 0 1.2 0
extended device statistics
device r/s w/s kr/s kw/s wait svc_t %b
ad4 0.0 0.0 0.0 0.0 0 0.0 0
ad6 0.0 0.0 0.0 0.0 0 0.0 0
mfid0 0.8 83.8 6.4 2330.6 0 0.1 1
mfid1 0.0 1218.8 0.0 9750.2 0 0.0 4
ar0 0.0 0.4 0.0 6.4 0 0.5 0
extended device statistics
device r/s w/s kr/s kw/s wait svc_t %b
ad4 0.0 0.0 0.0 0.0 0 0.0 0
ad6 0.0 0.0 0.0 0.0 0 0.0 0
mfid0 0.0 68.6 0.0 1655.7 0 0.1 0
mfid1 0.0 1708.9 0.0 13671.4 0 0.0 5
ar0 0.0 0.6 0.0 3.2 0 0.5 0
extended device statistics
device r/s w/s kr/s kw/s wait svc_t %b
ad4 0.0 0.0 0.0 0.0 0 0.0 0
ad6 0.0 0.0 0.0 0.0 0 0.0 0
mfid0 0.0 71.6 0.0 1999.4 0 0.0 0
mfid1 0.0 1071.3 0.0 8594.1 0 0.1 3
ar0 0.0 0.0 0.0 0.0 0 0.0 0
extended device statistics
device r/s w/s kr/s kw/s wait svc_t %b
ad4 0.0 0.0 0.0 0.0 0 0.0 0
ad6 0.0 0.0 0.0 0.0 0 0.0 0
mfid0 0.4 1373.6 3.2 12956.5 0 550.7 35
mfid1 0.0 999.8 0.0 8022.7 0 0.8 37
ar0 0.0 0.0 0.0 0.0 0 0.0 0
extended device statistics
device r/s w/s kr/s kw/s wait svc_t %b
ad4 0.0 0.0 0.0 0.0 0 0.0 0
ad6 0.0 0.0 0.0 0.0 0 0.0 0
mfid0 0.8 68.8 8.0 1908.5 0 0.2 1
mfid1 0.0 1370.5 0.0 10962.3 0 0.0 3
ar0 0.0 0.6 0.0 9.6 0 0.6 0
extended device statistics
device r/s w/s kr/s kw/s wait svc_t %b
ad4 0.0 0.0 0.0 0.0 0 0.0 0
ad6 0.0 0.0 0.0 0.0 0 0.0 0
mfid0 1.2 63.8 9.6 1912.1 0 0.3 2
mfid1 0.0 1355.8 0.0 10846.2 0 0.0 3
ar0 0.0 1.0 0.0 16.0 0 1.1 0

Looking at /var/log/messages, everything appears to be ok.
The only thing I found, but it was from over a month ago, was the
following:
Nov 17 06:35:03 db1 kernel: MCA: Bank 8, Status 0x8c0000400001009f
Nov 17 06:35:03 db1 kernel: MCA: Global Cap 0x0000000000001c09, Status
0x0000000000000000
Nov 17 06:35:03 db1 kernel: MCA: Vendor "GenuineIntel", ID 0x206c2, APIC
ID 0
Nov 17 06:35:03 db1 kernel: MCA: CPU 0 COR (1) RD channel ?? memory
error
Nov 17 06:35:03 db1 kernel: MCA: Address 0x84aa98000
Nov 17 06:35:03 db1 kernel: MCA: Misc 0x4701041000041280
Nov 17 07:35:03 db1 kernel: MCA: Bank 8, Status 0x8c0000400001009f
Nov 17 07:35:03 db1 kernel: MCA: Global Cap 0x0000000000001c09, Status
0x0000000000000000
Nov 17 07:35:03 db1 kernel: MCA: Vendor "GenuineIntel", ID 0x206c2, APIC
ID 0
Nov 17 07:35:03 db1 kernel: MCA: CPU 0 COR (1) RD channel ?? memory
error
Nov 17 07:35:03 db1 kernel: MCA: Address 0x84aa98000
Nov 17 07:35:03 db1 kernel: MCA: Misc 0x4701041000046141
Nov 17 09:35:03 db1 kernel: MCA: Bank 8, Status 0x88000040000200cf
Nov 17 09:35:03 db1 kernel: MCA: Global Cap 0x0000000000001c09, Status
0x0000000000000000
Nov 17 09:35:03 db1 kernel: MCA: Vendor "GenuineIntel", ID 0x206c2, APIC
ID 0
Nov 17 09:35:03 db1 kernel: MCA: CPU 0 COR (1) MS channel ?? memory
error
Nov 17 09:35:03 db1 kernel: MCA: Misc 0x4701041000046141
Nov 18 13:34:59 db1 kernel: MCA: Bank 8, Status 0xcc0000800001009f
Nov 18 13:34:59 db1 kernel: MCA: Global Cap 0x0000000000001c09, Status
0x0000000000000000
Nov 18 13:34:59 db1 kernel: MCA: Vendor "GenuineIntel", ID 0x206c2, APIC
ID 0
Nov 18 13:34:59 db1 kernel: MCA: CPU 0 COR (2) OVER RD channel ?? memory
error
Nov 18 13:34:59 db1 kernel: MCA: Address 0x84aa98000
Nov 18 13:34:59 db1 kernel: MCA: Misc 0x4701041000044707
But it has not reoccurred.

> -----Original Message-----
> From: Tom Lane [mailto:tgl(at)sss(dot)pgh(dot)pa(dot)us]
> Sent: Thursday, December 22, 2011 10:55 AM
> To: Benjamin Krajmalnik
> Cc: pgsql-admin
> Subject: Re: [ADMIN] Problem with pgstat timneouts
>
> "Benjamin Krajmalnik" <kraj(at)servoyant(dot)com> writes:
> > About a month ago, I started receiving quite a few pgstat timeouts
on
> my
> > production database.
>
> > PostgreSQL 9.0.3 on amd64-portbld-freebsd8.1, compiled by GCC cc
> (GCC)
> > 4.2.1 20070719 [FreeBSD], 64-bit
>
> > I am not sure where to start looking for the cause.
>
> Yeah, so are the rest of us :-(. People have been reporting sometimes
> seeing these messages without any clear cause. We've been unable to
> find any bug in Postgres that would explain them, and the obvious
> explanation of "your system is so overloaded that the stats collector
> is too slow to respond" isn't very satisfying if response time seems
> snappy otherwise. I think we have seen one case where the problem was
> provably caused by significant (multi-second) clock skew between
> different processors, but that's likely not applicable to many people.
>
> Can you reconstruct what changed in your system around the time you
> started seeing these?
>
> > It appears that these timeouts are affecting the performance of the
> > application - sometimes it hangs for a few seconds before data gets
> > returned, which was not the case before this.
>
> Usually these timeouts are reported by autovacuum, and so should not
be
> affecting the performance of foreground queries, at least not queries
> that are not explicitly requesting pgstats data. The fact that your
> log messages don't include any query context reinforces the idea that
> they are coming from autovacuum for you too. So if you are seeing
> query response time problems, maybe the pgstat timeouts are another
> symptom, not the cause, of an underlying too-much-load problem. You
> might try turning on query duration logging and see if you can
> correlate
> response-time glitches and pgstat timeouts to anything else
observable,
> such as peaks of system load average.
>
> regards, tom lane

In response to

Responses

Browse pgsql-admin by date

  From Date Subject
Next Message Tom Lane 2011-12-22 18:40:22 Re: impact of truncate table on indexes
Previous Message Dinesh Bhandary 2011-12-22 18:06:00 impact of truncate table on indexes