Re: checkpointer continuous flushing

From: Andres Freund <andres(at)anarazel(dot)de>
To: Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>
Cc: PostgreSQL Developers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: checkpointer continuous flushing
Date: 2015-11-11 07:38:57
Message-ID: 20151111073857.GX32209@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2015-09-10 17:15:26 +0200, Fabien COELHO wrote:
> Here is a v13, which is just a rebase after 1aba62ec.

And here's v14. It's not something entirely ready. A lot of details have
changed, I unfortunately don't remember them all. But there are more
important things than the details of the patch.

I've played *a lot* with this patch. I found a bunch of issues:

1) The FileFlushContext context infrastructure isn't actually
correct. There's two problems: First, using the actual 'fd' number to
reference a to-be-flushed file isn't meaningful. If there are lots
of files open, fds get reused within fd.c. That part is enough fixed
by referencing File instead the fd. The bigger problem is that the
infrastructure doesn't deal with files being closed. There can, which
isn't that hard to trigger, be smgr invalidations causing smgr handle
and thus the file to be closed.

I think this means that the entire flushing infrastructure actually
needs to be hoisted up, onto the smgr/md level.

2) I noticed that sync_file_range() blocked far more often than I'd
expected. Reading the kernel code that turned out to be caused by a
pessimization in the kernel introduced years ago - in many situation
SFR_WRITE waited for the writes. A fix for this will be in the 4.4
kernel.

3) I found that latency wasn't improved much for workloads that are
significantly bigger than shared buffers. The problem here is that
neither bgwriter nor the backends have, so far, done
sync_file_range() calls. That meant that the old problem of having
gigabytes of dirty data that periodically get flushed out, still
exists. Having these do flushes mostly attacks that problem.

Benchmarking revealed that for workloads where the hot data set mostly
fits into shared buffers flushing and sorting is anywhere from a small
to a massive improvement, both in throughput and latency. Even without
the patch from 2), although fixing that improves things furhter.

What I did not expect, and what confounded me for a long while, is that
for workloads where the hot data set does *NOT* fit into shared buffers,
sorting often led to be a noticeable reduction in throughput. Up to
30%. The performance was still much more regular than before, i.e. no
more multi-second periods without any transactions happening.

By now I think I know what's going on: Before the sorting portion of the
patch the write-loop in BufferSync() starts at the current clock hand,
by using StrategySyncStart(). But after the sorting that obviously
doesn't happen anymore - buffers are accessed in their sort order. By
starting at the current clock hand and moving on from there the
checkpointer basically makes it more less likely that victim buffers
need to be written either by the backends themselves or by
bgwriter. That means that the sorted checkpoint writes can, indirectly,
increase the number of unsorted writes by other processes :(

My benchmarking suggest that that effect is the larger, the shorter the
checkpoint timeout is. That seems to intuitively make sense, give the
above explanation attempt. If the checkpoint takes longer the clock hand
will almost certainly soon overtake checkpoints 'implicit' hand.

I'm not sure if we can really do anything about this problem. While I'm
pretty jet lagged, I still spent a fair amount of time thinking about
it. Seems to suggest that we need to bring back the setting to
enable/disable sorting :(

What I think needs to happen next with the patch is:
1) Hoist up the FileFlushContext stuff into the smgr layer. Carefully
handling the issue of smgr invalidations.
2) Replace the boolean checkpoint_flush_to_disk GUC with a list guc that
later can contain multiple elements like checkpoint, bgwriter,
backends, ddl, bulk-writes. That seems better than adding GUCs for
these separately. Then make the flush locations in the patch
configurable using that.
3) I think we should remove the sort timing from the checkpoint logging
before commit. It'll always be pretty short.

Greetings,

Andres Freund

Attachment Content-Type Size
0001-ckpt-14-andres.patch text/x-patch 49.3 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Torsten Zühlsdorff 2015-11-11 08:02:20 Re: can we add SKIP LOCKED to UPDATE?
Previous Message Michael Paquier 2015-11-11 06:55:02 Re: Dangling Client Backend Process