Re: strange pgbench results (as if blocked at the end)

From: Greg Smith <greg(at)2ndQuadrant(dot)com>
To: pgsql-performance(at)postgresql(dot)org
Subject: Re: strange pgbench results (as if blocked at the end)
Date: 2011-08-13 03:09:07
Message-ID: 4E45EAD3.9030102@2ndQuadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

On 08/12/2011 07:37 PM, Tomas Vondra wrote:
> I've run nearly 200 of these, and in about 10 cases I got something that
> looks like this:
>
> http://www.fuzzy.cz/tmp/pgbench/tps.png
> http://www.fuzzy.cz/tmp/pgbench/latency.png
>
> i.e. it runs just fine for about 3:40 and then something goes wrong. The
> bench should take 5:00 minutes, but it somehow locks, does nothing for
> about 2 minutes and then all the clients end at the same time. So instead
> of 5 minutes the run actually takes about 6:40.
>

You need to run tests like these for 10 minutes to see the full cycle of
things; then you'll likely see them on most runs, instead of only 5%.
It's probably the case that some of your tests are finishing before the
first checkpoint does, which is why you don't see the bad stuff every time.

The long pauses are most likely every client blocking once the
checkpoint sync runs. When those fsync calls go out, Linux will freeze
for quite a while there on ext3. In this example, the drop in TPS/rise
in latency at around 50:30 is either the beginning of a checkpoint or
the dirty_background_ratio threshold in Linux being exceeded; they tend
to happen around the same time. It executes the write phase for a bit,
then gets into the sync phase around 51:40. You can find a couple of
examples just like this on my giant test set around what was committed
as the fsync compaction feature in 9.1, all at
http://www.2ndquadrant.us/pgbench-results/index.htm

The one most similar to your case is
http://www.2ndquadrant.us/pgbench-results/481/index.html Had that test
only run for 5 minutes, it would have looked just like yours, ending
after the long pause that's in the middle on my run. The freeze was
over 3 minutes long in that example. (My server has a fairly fast disk
subsystem, probably faster than what you're testing, but it also has 8GB
of RAM that it can dirty to more than make up for it).

In my tests, I switched from ext3 to XFS to get better behavior. You
got the same sort of benefit from ext4. ext3 just doesn't handle its
write cache filling and then having fsync calls execute very well. I've
given up on that as an unsolvable problem; improving behavior on XFS and
ext4 are the only problems worth worrying about now to me.

And I keep seeing too many data corruption issues on ext4 to recommend
anyone use it yet for PostgreSQL, that's why I focused on XFS. ext4
still needs at least a few more months before all the bug fixes it's
gotten in later kernels are backported to the 2.6.32 versions deployed
in RHEL6 and Debian Squeeze, the newest Linux distributions my customers
care about right now. On RHEL6 for example, go read
http://docs.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/6/html/6.1_Technical_Notes/kernel.html
, specifically BZ#635199, and you tell me if that sounds like it's
considered stable code yet or not. "The block layer will be updated in
future kernels to provide this more efficient mechanism of ensuring
ordering...these future block layer improvements will change some kernel
interfaces..." Yikes, that does not inspire confidence to me.

--
Greg Smith 2ndQuadrant US greg(at)2ndQuadrant(dot)com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message tv 2011-08-14 12:51:37 Re: strange pgbench results (as if blocked at the end)
Previous Message Craig Ringer 2011-08-13 00:18:48 Re: strange pgbench results (as if blocked at the end)