Skip site navigation (1) Skip section navigation (2)

Re: Initial 9.2 pgbench write results

From: Greg Smith <greg(at)2ndQuadrant(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Initial 9.2 pgbench write results
Date: 2012-02-23 11:17:32
Message-ID: 4F46204C.1000501@2ndQuadrant.com (view raw or flat)
Thread:
Lists: pgsql-hackers
I've updated http://highperfpostgres.com/results-write-9.2-cf4/index.htm 
with more data including two alternate background writer configurations. 
  Since the sensitive part of the original results was scales of 500 and 
1000, I've also gone back and added scale=750 runs to all results. 
Quick summary is that I'm not worried about 9.2 performance now, I'm 
increasingly confident that the earlier problems I reported on are just 
bad interactions between the reinvigorated background writer and 
workloads that are tough to write to disk.  I'm satisfied I understand 
these test results well enough to start evaluating the pending 9.2 
changes in the CF queue I wanted to benchmark.

Attached are now useful client and scale graphs.  All of 9.0, 9.1, and 
9.2 have been run now with exactly the same scales and clients loads, so 
the graphs of all three versions can be compared.  The two 9.2 
variations with alternate parameters were only run at some scales, which 
means you can't compare them usefully on the clients graph; only on the 
scaling one.  They are very obviously in a whole different range of that 
graph, just ignore the two that are way below the rest.

Here's a repeat of the interesting parts of the data set with new 
points.  Here "9.2N" is without no background writer, while "9.2H" has a 
background writer set to half strength:  bgwriter_lru_maxpages = 50  I 
picked one middle client level out of the result=750 results just to 
focus better, relative results are not sensitive to that:

scale=500, db is 46% of RAM
Version Avg TPS
9.0  1961
9.1  2255
9.2  2525
9.2N 2267
9.2H 2300

scale=750, db is 69% of RAM; clients=16
Version Avg TPS
9.0  1298
9.1  1387
9.2  1477
9.2N 1489
9.2H 943

scale=1000, db is 94% of RAM; clients=4
Version TPS
9.0 535
9.1 491 (-8.4% relative to 9.0)
9.2 338 (-31.2% relative to 9.1)
9.2N 516
9.2H 400

The fact that almost all the performance regression against 9.2 goes 
away if the background writer is disabled is an interesting point.  That 
results actually get worse at scale=500 without the background writer is 
another.  That pair of observations makes me feel better that there's a 
tuning trade-off here being implicitly made by having a more active 
background writer in 9.2; it helps on some cases, hurts others.  That I 
can deal with.  Everything lines up perfectly at scale=500:  if I 
reorder on TPS:

scale=500, db is 46% of RAM
Version Avg TPS
9.2  2525
9.2H 2300
9.2N 2267
9.1  2255
9.0  1961

That makes you want to say "the more background writer the better", right?

The new scale=750 numbers are weird though, and they keep this from 
being so clear.  I ran the parts that were most weird twice just because 
they seemed so odd, and it was repeatable.  Just like scale=500, with 
scale=750 the 9.2/no background writer has the best performance of any 
run.  But the half-intensity one has the worst!  It would be nice if it 
fell between the 9.2 and 9.2N results, instead it's at the other edge.

The only lesson I can think to draw here is that once we're in the area 
where performance is dominated by the trivia around exactly how writes 
are scheduled, the optimal ordering of writes is just too complicated to 
model that easily.  The rest of this is all speculation on how to fit 
some ideas to this data.

Going back to 8.3 development, one of the design trade-offs I was very 
concerned about was not wasting resources by having the BGW run too 
often.  Even then it was clear that for these simple pgbench tests, 
there were situations where letting backends do their own writes was 
better than speculative writes from the background writer.  The BGW 
constantly risks writing a page that will be re-dirtied before it goes 
to disk.  That can't be too common though in the current design, since 
it avoids things with high usage counts.  (The original BGW wrote things 
that were used recently, and that was a measurable problem by 8.3)

I think an even bigger factor now is that the BGW writes can disturb 
write ordering/combining done at the kernel and storage levels.  It's 
painfully obvious now how much PostgreSQL relies on that to get good 
performance.  All sorts of things break badly if we aren't getting 
random writes scheduled to optimize seek times, in as many contexts as 
possible.  It doesn't seem unreasonable that background writer writes 
can introduce some delay into the checkpoint writes, just by adding more 
random components to what is already a difficult to handle write/sync 
series.  That's what I think what these results are showing is that 
background writer writes can deoptimize other forms of write.

A second fact that's visible from the TPS graphs over the test run, and 
obvious if you think about it, is that BGW writes force data to physical 
disk earlier than they otherwise might go there.  That's a subtle 
pattern in the graphs.  I expect that though, given one element to "do I 
write this?" in Linux is how old the write is.  Wondering about this 
really emphasises that I need to either add graphing of vmstat/iostat 
data to these graphs or switch to a benchmark that does that already.  I 
think I've got just enough people using pgbench-tools to justify the 
feature even if I plan to use the program less.

I also have a good answer to "why does this only happen at these 
scales?" now.  At scales below here, the database is so small relative 
to RAM that it just all fits all the time.  That includes the indexes 
being very small, so not many writes generated by their dirty blocks. 
At higher scales, the write volume becomes seek bound, and the result is 
so low that checkpoints become timeout based.  So there are 
significantly less of them.  At the largest scales and client counts 
here, there isn't a single checkpoint actually finished at some of these 
10 minute long runs.  One doesn't even start until 5 minutes have gone 
by, and the checkpoint writes are so slow they take longer than 5 
minutes to trickle out and sync, with all the competing I/O from 
backends mixed in.  Note that the "clients-sets" graph still shows a 
strong jump from 9.0 to 9.1 at high client counts; I'm pretty sure 
that's the fsync compaction at work.

-- 
Greg Smith   2ndQuadrant US    greg(at)2ndQuadrant(dot)com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com

Attachment: clients-sets.png
Description: image/png (6.0 KB) (inlined above)
Attachment: scaling-sets.png
Description: image/png (5.6 KB) (inlined above)

In response to

Responses

pgsql-hackers by date

Next:From: Hitoshi HaradaDate: 2012-02-23 11:35:53
Subject: Re: Finer Extension dependencies
Previous:From: Kohei KaiGaiDate: 2012-02-23 11:17:14
Subject: Re: [v9.2] Add GUC sepgsql.client_label

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group