Re: Redesigning checkpoint_segments

From: Greg Smith <greg(at)2ndQuadrant(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Redesigning checkpoint_segments
Date: 2013-06-07 02:06:12
Message-ID: 51B14014.6080208@2ndQuadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 6/6/13 4:42 AM, Joshua D. Drake wrote:
>
> On 6/6/2013 1:11 AM, Heikki Linnakangas wrote:
>>
>> (I'm sure you know this, but:) If you perform a checkpoint as fast and
>> short as possible, the sudden burst of writes and fsyncs will
>> overwhelm the I/O subsystem, and slow down queries. That's what we saw
>> before spread checkpoints: when a checkpoint happens, the response
>> times of queries jumped up.
>
> That isn't quite right. Previously we had lock issues as well and
> checkpoints would take considerable time to complete. What I am talking
> about is that the background writer (and wal writer where applicable)
> have done all the work before a checkpoint is even called.

That is not possible, and if you look deeper at a lot of workloads
you'll eventually see why. I'd recommend grabbing snapshots of
pg_buffercache output from a lot of different types of servers and see
what the usage count distribution looks like. That's what did in order
to create all of the behaviors the current background writer code caters
to. Attached is a small spreadsheet that shows the main two extremes
here, from one of my old talks. "Effective buffer cache system" is full
of usage count 5 pages, while the "Minimally effective buffer cache" one
is all usage count 1 or 0. We don't have redundant systems here; we
have two that aim at distinctly different workloads. That's one reason
why splitting them apart ended up being necessary to move forward, they
really don't overlap very much on some servers.

Sampling a few servers that way was where the controversial idea of
scanning the whole buffer pool every few minutes even without activity
came from too. I found a bursty real world workload where that was
necessary to keep buffers clean usefully, and that heuristic helped them
a lot. I too would like to visit the exact logic used, but I could cook
up a test case where it's useful again if people really doubt it has any
value. There's one in the 2007 archives somewhere.

The reason the checkpointer code has to do this work, and it has to
spread the writes out, is that on some systems the hot data set hits a
high usage count. If shared_buffers is 8GB and at any moment 6GB of it
has a usage count of 5, which absolutely happens on many busy servers,
the background writer will do almost nothing useful. It won't and
shouldn't touch buffers unless their usage count is low. Those heavily
referenced blocks will only be written to disk once per checkpoint cycle.

Without the spreading, in this example you will drop 6GB into "Dirty
Memory" on a Linux server, call fdatasync, and the server might stop
doing any work at all for *minutes* of time. Easiest way to see it
happen is to set checkpoint_completion_target to 0, put the filesystem
on ext3, and have a server with lots of RAM. I have a monitoring tool
that graphs Dirty Memory over time because this problem is so nasty even
with the spreading code in place.

There is this idea that pops up sometimes that a background writer write
is better than a checkpoint one. This is backwards. A dirty block must
be written at least once per checkpoint. If you only write it once per
checkpoint, inside of the checkpoint process, that is the ideal. It's
what you want for best performance when it's possible.

At the same time, some workloads churn through a lot of low usage count
data, rather than building up a large block of high usage count stuff.
On those your best hope for low latency is to crank up the background
writer and let it try to stay ahead of backends with the writes. The
checkpointer won't have nearly as much work to do in that situation.

--
Greg Smith 2ndQuadrant US greg(at)2ndQuadrant(dot)com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com

Attachment Content-Type Size
bgwriter-snapshot.xls application/vnd.ms-excel 19.0 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Greg Smith 2013-06-07 02:43:31 Re: Redesigning checkpoint_segments
Previous Message Tom Lane 2013-06-07 01:32:21 Re: SPGist "triple parity" concept doesn't work