Strange behavior: pgbench and new Linux kernels

From: Greg Smith <gsmith(at)gregsmith(dot)com>
To: pgsql-performance(at)postgresql(dot)org
Subject: Strange behavior: pgbench and new Linux kernels
Date: 2008-04-17 07:58:43
Message-ID: Pine.GSO.4.64.0804170230180.26917@westnet.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

This week I've finished building and installing OSes on some new hardware
at home. I have a pretty standard validation routine I go through to make
sure PostgreSQL performance is good on any new system I work with. Found
a really strange behavior this time around that seems related to changes
in Linux. Don't expect any help here, but if someone wanted to replicate
my tests I'd be curious to see if that can be done. I tell the story
mostly because I think it's an interesting tale in hardware and software
validation paranoia, but there's a serious warning here as well for Linux
PostgreSQL users.

The motherboard is fairly new, and I couldn't get CentOS 5.1, which ships
with kernel 2.6.18, to install with the default settings. I had to drop
back to "legacy IDE" mode to install. But it was running everything in
old-school IDE mode, no DMA or antyhing. "hdparm -Tt" showed a whopping
3MB/s on reads.

I pulled down the latest (at the time--only a few hours and I'm already
behind) Linux kernel, 2.6.24-4, and compiled that with the right modules
included. Now I'm getting 70MB/s on simple reads. Everything looked fine
from there until I got to the pgbench select-only tests running PG 8.2.7
(I do 8.2 then 8.3 separately because the checkpoint behavior on
write-heavy stuff is so different and I want to see both results).

Here's the regular thing I do to see how fast pgbench executes against
things in memory (but bigger than the CPU's cache):

-Set shared_buffers=256MB, start the server
-dropdb pgbench (if it's already there)
-createdb pgbench
-pgbench -i -s 10 pgbench (makes about a 160MB database)
-pgbench -S -c <2*cores> -t 10000 pgbench

Since the database was just written out, the whole thing will still be in
the shared_buffers cache, so this should execute really fast. This was an
Intel quad-core system, I used -c 8, and that got me around 25K
transactions/second. Curious to see how high I could push this, I started
stepping up the number of clients.

There's where the weird thing happened. Just by going to 12 clients
instead of 8, I dropped to 8.5K TPS, about 1/3 of what I get from 8
clients. It was like that on every test run. When I use 10 clients, it's
about 50/50; sometimes I get 25K, sometimes 8.5K. The only thing it
seemed to correlate with is that vmstat on the 25K runs showed ~60K
context switches/second, while the 8.5K ones had ~44K.

Since I've never seen this before, I went back to my old benchmark system
with a dual-core AMD processor. That started with CentOS 4 and kernel
2.6.9, but I happened to install kernel 2.6.24-3 on there to get better
support for my Areca card (it goes bonkers regularly on x64 2.6.9).
Never did a thorough perforance test of the new kernel though. Sure
enough, the same behavior was there, except without a flip-flop point,
just a sharp decline. Check this out:

-bash-3.00$ pgbench -S -c 8 -t 10000 pgbench | grep excluding
tps = 15787.684067 (excluding connections establishing)
tps = 15551.963484 (excluding connections establishing)
tps = 14904.218043 (excluding connections establishing)
tps = 15330.519289 (excluding connections establishing)
tps = 15606.683484 (excluding connections establishing)

-bash-3.00$ pgbench -S -c 12 -t 10000 pgbench | grep excluding
tps = 7593.572749 (excluding connections establishing)
tps = 7870.053868 (excluding connections establishing)
tps = 7714.047956 (excluding connections establishing)

Results are consistant, right? Summarizing that and extending out, here's
what the median TPS numbers look like with 3 tests at each client load:

-c4: 16621 (increased -t to 20000 here)
-c8: 15551 (all these with t=10000)
-c9: 13269
-c10: 10832
-c11: 8993
-c12: 7714
-c16: 7311
-c32: 7141 (cut -t to 5000 here)

Now, somewhere around here I start thinking about CPU cache coherency, I
play with forcing tasks to particular CPUs, I try the deadline scheduler
instead of the default CFQ, but nothing makes a difference.

Wanna guess what did? An earlier kernel. These results are the same test
as above, same hardware, only difference is I used the standard CentOS 4
2.6.9-67.0.4 kernel instead of 2.6.24-3.

-c4: 18388
-c8: 15760
-c9: 15814 (one result of 12623)
-c12: 14339 (one result of 11105)
-c16: 14148
-c32: 13647 (one result of 10062)

We get the usual bit of pgbench flakiness, but using the earlier kernel is
faster in every case, only degrades slowly as clients increase, and is
almost twice as fast here in a typical high-client load case.

So in the case of this simple benchmark, I see an enormous performance
regression from the newest Linux kernel compared to a much older one. I
need to do some version bisection to nail it down for sure, but my guess
is it's the change to the Completely Fair Scheduler in 2.6.23 that's to
blame. The recent FreeBSD 7.0 PostgreSQL benchmarks at
http://people.freebsd.org/~kris/scaling/7.0%20and%20beyond.pdf showed an
equally brutal performance drop going from 2.6.22 to 2.6.23 (see page 16)
in around the same client load on a read-only test. My initial guess is
that I'm getting nailed by a similar issue here.

--
* Greg Smith gsmith(at)gregsmith(dot)com http://www.gregsmith.com Baltimore, MD

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Richard Huxton 2008-04-17 08:15:04 Re: db size
Previous Message Adrian Moisey 2008-04-17 06:28:42 Re: db size