Skip site navigation (1) Skip section navigation (2)

Re: [Lsf-pc] Linux kernel impact on PostgreSQL performance

From: Gregory Smith <gregsmithpgsql(at)gmail(dot)com>
To: Mel Gorman <mgorman(at)suse(dot)de>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc: Josh Berkus <josh(at)agliodbs(dot)com>, Kevin Grittner <kgrittn(at)ymail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, Joshua Drake <jd(at)commandprompt(dot)com>, Claudio Freire <klaussfreire(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Jim Nasby <jim(at)nasby(dot)net>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "lsf-pc(at)lists(dot)linux-foundation(dot)org" <lsf-pc(at)lists(dot)linux-foundation(dot)org>, Magnus Hagander <magnus(at)hagander(dot)net>
Subject: Re: [Lsf-pc] Linux kernel impact on PostgreSQL performance
Date: 2014-01-17 20:24:01
Message-ID: 52D99161.60305@gmail.com (view raw or flat)
Thread:
Lists: pgsql-hackers
On 1/17/14 10:37 AM, Mel Gorman wrote:
> There is not an easy way to tell. To be 100%, it would require an 
> instrumentation patch or a systemtap script to detect when a 
> particular page is being written back and track the context. There are 
> approximations though. Monitor nr_dirty pages over time.

I have a benchmarking wrapper for the pgbench testing program called 
pgbench-tools:  https://github.com/gregs1104/pgbench-tools  As of 
October, on Linux it now plots the "Dirty" value from /proc/meminfo over 
time.  You get that on the same time axis as the transaction latency 
data.  The report at the end includes things like the maximum amount of 
dirty memory observed during the test sampling. That doesn't tell you 
exactly what's happening to the level someone reworking the kernel logic 
might want, but you can easily see things like the database's checkpoint 
cycle reflected by watching the dirty memory total.  This works really 
well for monitoring production servers too.  I have a lot of data from a 
plugin for the Munin monitoring system that plots the same way.  Once 
you have some history about what's normal, it's easy to see when systems 
fall behind in a way that's ruining writes, and the high water mark 
often correlates with bad responsiveness periods.

Another recent change is that pgbench for the upcoming PostgreSQL 9.4 
now allows you to specify a target transaction rate.  Seeing the write 
latency behavior with that in place is far more interesting than 
anything we were able to watch with pgbench before.  The pgbench write 
tests we've been doing for years mainly told you the throughput rate 
when all of the caches were always as full as the database could make 
them, and tuning for that is not very useful. Turns out it's far more 
interesting to run at 50% of what the storage is capable of, then watch 
what happens to latency when you adjust things like the dirty_* parameters.

I've been working on the problem of how we can make a benchmark test 
case that acts enough like real busy PostgreSQL servers that we can 
share it with kernel developers, and then everyone has an objective way 
to measure changes.  These rate limited tests are working much better 
for that than anything I came up with before.

I am skeptical that the database will take over very much of this work 
and perform better than the Linux kernel does.  My take is that our most 
useful role would be providing test cases kernel developers can add to a 
performance regression suite.  Ugly "we never though that would happen" 
situations seems at the root of many of the kernel performance 
regressions people here get nailed by.

Effective I/O scheduling is very hard, and we are unlikely to ever out 
innovate the kernel hacking community by pulling more of that into the 
database.  It's already possible to experiment with moving in that 
direction with tuning changes.  Use a larger database shared_buffers 
value, tweak checkpoints to spread I/O out, and reduce things like 
dirty_ratio.  I do some of that, but I've learned it's dangerous to 
wander too far that way.

If instead you let Linux do even more work--give it a lot of memory to 
manage and room to re-order I/O--that can work out quite well. For 
example, I've seen a lot of people try to keep latency down by using the 
deadline scheduler and very low settings for the expire times.  Theory 
is great, but it never works out in the real world for me though.  
Here's the sort of deadline I deploy instead now:

     echo 500      > ${DEV}/queue/iosched/read_expire
     echo 300000   > ${DEV}/queue/iosched/write_expire
     echo 1048576  > ${DEV}/queue/iosched/writes_starved

These numbers look insane compared to the defaults, but I assure you 
they're from a server that's happily chugging through 5 to 10K 
transactions/second around the clock.  PostgreSQL forces writes out with 
fsync when they must go out, but this sort of tuning is basically giving 
up on it managing writes beyond that.  We really have no idea what order 
they should go out in.  I just let the kernel have a large pile of work 
queued up, and trust things like the kernel's block elevator and 
congestion code are smarter than the database can possibly be.

-- 
Greg Smith greg(dot)smith(at)crunchydatasolutions(dot)com
Chief PostgreSQL Evangelist - http://crunchydatasolutions.com/


In response to

Responses

pgsql-hackers by date

Next:From: Tom LaneDate: 2014-01-17 20:42:01
Subject: Re: [PATCH] pgcrypto: implement gen_random_uuid
Previous:From: Alvaro HerreraDate: 2014-01-17 20:19:58
Subject: Re: currawong is not a happy animal

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group