Re: Improvement of checkpoint IO scheduler for stable transaction responses

From: Greg Smith <greg(at)2ndQuadrant(dot)com>
To: Ants Aasma <ants(at)cybertec(dot)at>
Cc: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, KONDO Mitsumasa <kondo(dot)mitsumasa(at)lab(dot)ntt(dot)co(dot)jp>
Subject: Re: Improvement of checkpoint IO scheduler for stable transaction responses
Date: 2013-07-16 18:17:25
Message-ID: 51E58E35.4070500@2ndQuadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 7/16/13 12:46 PM, Ants Aasma wrote:

> Spread checkpoints sprinkles the writes out over a long
> period and the general tuning advice is to heavily bound the amount of
> memory the OS willing to keep dirty.

That's arguing that you can make this feature be useful if you tune in a
particular way. That's interesting, but the goal here isn't to prove
the existence of some workload that a change is useful for. You can
usually find a test case that validates any performance patch as helpful
if you search for one. Everyone who has submitted a sorted checkpoint
patch for example has found some setup where it shows significant gains.
We're trying to keep performance stable across a much wider set of
possibilities though.

Let's talk about default parameters instead, which quickly demonstrates
where your assumptions fail. The server I happen to be running pgbench
tests on today has 72GB of RAM running SL6 with RedHat derived kernel
2.6.32-358.11.1. This is a very popular middle grade server
configuration nowadays. There dirty_background_ratio and
dirty_background_ratio are 10 (percent). That means that roughly 7GB of
RAM can be used for write caching. Note that this is a fairly low write
cache tuning compared to a survey of systems in the field--lots of
people have servers with earlier kernels where these numbers can be as
high as 20 or even 40% instead.

The current feasible tuning for shared_buffers suggests a value of 8GB
is near the upper limit, beyond which cache related overhead makes
increases counterproductive. Your examples are showing 53% of
shared_buffers dirty at checkpoint time; that's typical. The
checkpointer is then writing out just over 4GB of data.

With that background what process here has more data to make decisions with?

-The operating system has 7GB of writes it's trying to optimize. That
potentially includes backend, background writer, checkpoint, temp table,
statistics, log, and WAL data. The scheduler is also considering read
operations.

-The checkpointer process has 4GB of writes from rarely written shared
memory it's trying to optimize.

This is why if you take the opposite approach of yours today--go
searching for workloads where sorting is counterproductive--those are
equally easy to find. Any test of write speed I do starts with about 50
different scale/client combinations. Why do I suggest pgbench-tools as
a way to do performance tests? It's because an automated sweep of
client setups like it does is the minimum necessary to create enough
variation in workload for changing the database's write path. It's
really amazing how often doing that shows a proposed change is just
shuffling the good and bad cases around. That's been the case for every
sorting and fsync delay change submitted so far. I'm not even
interested in testing today's submission because I tried that particular
approach for a few months, twice so far, and it fell apart on just as
many workloads as it helped.

> The checkpointer has the best long term overview of the situation here, OS
> scheduling only has the short term view of outstanding read and write
> requests.

True only if shared_buffers is large compared to the OS write cache,
which was not the case on the example I generated with all of a minute's
work. I regularly see servers where Linux's "Dirty" area becomes a
multiple of the dirty buffers written by a checkpoint. I can usually
make that happen at will with CLUSTER and VACUUM on big tables. The
idea that the checkpointer has a long-term view while the OS has a short
one, that presumes a setup that I would say is possible but not common.

> kernel settings: dirty_background_bytes = 32M,
> dirty_bytes = 128M.

You disclaimed this as a best case scenario. It is a low throughput /
low latency tuning. That's fine, but if Postgres optimizes itself
toward those cases it runs the risk of high throughput servers with
large caches being detuned. I've posted examples before showing very
low write caches like this leading to VACUUM running at 1/2 its normal
speed or worse, as a simple example of where a positive change in one
area can backfire badly on another workload. That particular problem
was so common I updated pgbench-tools recently to track table
maintenance time between tests, because that demonstrated an issue even
when the TPS numbers all looked fine.

--
Greg Smith 2ndQuadrant US greg(at)2ndQuadrant(dot)com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Kevin Grittner 2013-07-16 18:29:58 pgsql: Add support for REFRESH MATERIALIZED VIEW CONCURRENTLY.
Previous Message Robert Haas 2013-07-16 17:58:52 Re: Differences in WHERE clause of SELECT