Re: [Lsf-pc] Linux kernel impact on PostgreSQL performance

From: Gregory Smith <gregsmithpgsql(at)gmail(dot)com>
To: Mel Gorman <mgorman(at)suse(dot)de>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc: Josh Berkus <josh(at)agliodbs(dot)com>, Kevin Grittner <kgrittn(at)ymail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, Joshua Drake <jd(at)commandprompt(dot)com>, Claudio Freire <klaussfreire(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Jim Nasby <jim(at)nasby(dot)net>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "lsf-pc(at)lists(dot)linux-foundation(dot)org" <lsf-pc(at)lists(dot)linux-foundation(dot)org>, Magnus Hagander <magnus(at)hagander(dot)net>
Subject: Re: [Lsf-pc] Linux kernel impact on PostgreSQL performance
Date: 2014-01-17 20:24:01
Message-ID: 52D99161.60305@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 1/17/14 10:37 AM, Mel Gorman wrote:
> There is not an easy way to tell. To be 100%, it would require an
> instrumentation patch or a systemtap script to detect when a
> particular page is being written back and track the context. There are
> approximations though. Monitor nr_dirty pages over time.

I have a benchmarking wrapper for the pgbench testing program called
pgbench-tools: https://github.com/gregs1104/pgbench-tools As of
October, on Linux it now plots the "Dirty" value from /proc/meminfo over
time. You get that on the same time axis as the transaction latency
data. The report at the end includes things like the maximum amount of
dirty memory observed during the test sampling. That doesn't tell you
exactly what's happening to the level someone reworking the kernel logic
might want, but you can easily see things like the database's checkpoint
cycle reflected by watching the dirty memory total. This works really
well for monitoring production servers too. I have a lot of data from a
plugin for the Munin monitoring system that plots the same way. Once
you have some history about what's normal, it's easy to see when systems
fall behind in a way that's ruining writes, and the high water mark
often correlates with bad responsiveness periods.

Another recent change is that pgbench for the upcoming PostgreSQL 9.4
now allows you to specify a target transaction rate. Seeing the write
latency behavior with that in place is far more interesting than
anything we were able to watch with pgbench before. The pgbench write
tests we've been doing for years mainly told you the throughput rate
when all of the caches were always as full as the database could make
them, and tuning for that is not very useful. Turns out it's far more
interesting to run at 50% of what the storage is capable of, then watch
what happens to latency when you adjust things like the dirty_* parameters.

I've been working on the problem of how we can make a benchmark test
case that acts enough like real busy PostgreSQL servers that we can
share it with kernel developers, and then everyone has an objective way
to measure changes. These rate limited tests are working much better
for that than anything I came up with before.

I am skeptical that the database will take over very much of this work
and perform better than the Linux kernel does. My take is that our most
useful role would be providing test cases kernel developers can add to a
performance regression suite. Ugly "we never though that would happen"
situations seems at the root of many of the kernel performance
regressions people here get nailed by.

Effective I/O scheduling is very hard, and we are unlikely to ever out
innovate the kernel hacking community by pulling more of that into the
database. It's already possible to experiment with moving in that
direction with tuning changes. Use a larger database shared_buffers
value, tweak checkpoints to spread I/O out, and reduce things like
dirty_ratio. I do some of that, but I've learned it's dangerous to
wander too far that way.

If instead you let Linux do even more work--give it a lot of memory to
manage and room to re-order I/O--that can work out quite well. For
example, I've seen a lot of people try to keep latency down by using the
deadline scheduler and very low settings for the expire times. Theory
is great, but it never works out in the real world for me though.
Here's the sort of deadline I deploy instead now:

echo 500 > ${DEV}/queue/iosched/read_expire
echo 300000 > ${DEV}/queue/iosched/write_expire
echo 1048576 > ${DEV}/queue/iosched/writes_starved

These numbers look insane compared to the defaults, but I assure you
they're from a server that's happily chugging through 5 to 10K
transactions/second around the clock. PostgreSQL forces writes out with
fsync when they must go out, but this sort of tuning is basically giving
up on it managing writes beyond that. We really have no idea what order
they should go out in. I just let the kernel have a large pile of work
queued up, and trust things like the kernel's block elevator and
congestion code are smarter than the database can possibly be.

--
Greg Smith greg(dot)smith(at)crunchydatasolutions(dot)com
Chief PostgreSQL Evangelist - http://crunchydatasolutions.com/

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2014-01-17 20:42:01 Re: [PATCH] pgcrypto: implement gen_random_uuid
Previous Message Alvaro Herrera 2014-01-17 20:19:58 Re: currawong is not a happy animal