Merlin Moncure wrote:
>On Tue, Oct 6, 2009 at 10:59 AM, Michal Vitecek <fuf(at)mageo(dot)cz> wrote:
>> Merlin Moncure wrote:
>>>On Mon, Oct 5, 2009 at 5:17 AM, Michal Vitecek <fuf(at)mageo(dot)cz> wrote:
>>>> Could the problem be the HW RAID card? There's ServerRAID 8k with 256MB
>>>> with write-back enabled. Could it be that its internal cache becomes
>>>> full and all disk I/O operations are delayed until it writes all
>>>> changes to hard drives?
>>>that's possible...the red flag is going to be iowait. if your server
>>>can't keep up with the sync demands for example, you will eventually
>>>outrun the write cache and you can start to see slow queries. With
>>>your server though it would take in the hundreds of (write)
>>>transactions per second to do that minimum.
>> The problem is that the server is not loaded in any way. The iowait is
>> 0.62%, there's only 72 sectors written/s, but the maximum await that I
>> saw was 28ms (!). Any attempts to reduce the time (I/O schedulers,
>> disabling bgwriter, increasing number of checkpoints, decreasing shared
>> buffers, disabling read cache on the card etc.) didn't help. After some
>> 3-5m there occurs a COMMIT which takes 100-10000x longer time than
>> usual. Setting fsynch to off Temporarily improved the COMMIT times
>> considerably but I fear to have this option off all the time.
>> Is anybody else using the same RAID card? I suspect the problem lies
>> somewhere between the aacraid module and the card. The aacraid module
>> ignores setting of the 'cache' parameter to 3 -- this should completely
>> disable the SYNCHRONIZE_CACHE command.
>I think you're right. One thing you can do is leave fsync on but
>disable synchronous_commit. This is compromise between fsync on/off
>(data consistent following crash, but you may lose some transactions).
>We need to know what iowait is at the precise moment you get the long
>commit time. Throw a top, give it short update interval (like .25
>seconds), and watch.
I'm writing with resolution to the problem: It was indeed caused by the
IBM ServerRAID 8k SAS RAID card. Putting the WAL logs onto a separate
(not in RAID 5) hard drive helped tremendously with the COMMIT times
and the occurrence of the very long SQL query times dropped from 3-5min
to ~45min where only INSERT or UPDATE queries were slow. Flashing
firmware of the RAID card and of all hard drives fixed the problem
altogether. To explain why we waited for so long with the firmware
updates was because of the fact that IBM frequently puts a new version
on their servers and then, after a day or two, replaces it with a newer
version which fixes a critical bug introduced in the previous one.
In response to
pgsql-performance by date
|Next:||From: S Arvind||Date: 2009-10-12 11:21:27|
|Subject: Query performance|
|Previous:||From: Pavel Stehule||Date: 2009-10-12 08:57:34|
|Subject: Re: Using unnest function on multi-dimensional array.|