checkpointer continuous flushing

From: Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>
To: PostgreSQL Developers <pgsql-hackers(at)postgresql(dot)org>
Subject: checkpointer continuous flushing
Date: 2015-06-01 11:40:20
Message-ID: alpine.DEB.2.10.1506011320000.28433@sto
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


Hello pg-devs,

This patch is a simplified and generalized version of Andres Freund's
August 2014 patch for flushing while writing during checkpoints, with some
documentation and configuration warnings added.

For the initial patch, see:

http://www.postgresql.org/message-id/20140827091922.GD21544@awork2.anarazel.de

For the whole thread:

http://www.postgresql.org/message-id/alpine.DEB.2.10.1408251900211.11151@sto

The objective is to help avoid PG stalling when fsyncing on checkpoints,
and in general to get better latency-bound performance.

Flushes are managed with pg throttled writes instead of waiting for the
checkpointer final "fsync" which induces occasional stalls. From
"pgbench -P 1 ...", such stalls look like this:

progress: 35.0 s, 615.9 tps, lat 1.344 ms stddev 4.043 # ok
progress: 36.0 s, 3.0 tps, lat 346.111 ms stddev 123.828 # stalled
progress: 37.0 s, 4.0 tps, lat 252.462 ms stddev 29.346 # ...
progress: 38.0 s, 161.0 tps, lat 6.968 ms stddev 32.964 # restart
progress: 39.0 s, 701.0 tps, lat 1.421 ms stddev 3.326 # ok

I've seen similar behavior on FreeBSD with its native FS, so it is not a
Linux-specific or ext4-specific issue, even if both factor may contribute.

There are two implementations, first one based on "sync_file_range" is Linux
specific, while the other relies on "posix_fadvise". Tests below ran on Linux.
If someone could test the posix_fadvise version on relevant platforms, that
would be great...

The Linux specific "sync_file_range" approach was suggested among other ideas
by Theodore Ts'o on Robert Haas blog in March 2014:

http://rhaas.blogspot.fr/2014/03/linuxs-fsync-woes-are-getting-some.html

Two guc variables control whether the feature is activated for writes of
dirty pages issued by checkpointer and bgwriter. Given that the settings
may improve or degrade performance, having GUC seems justified. In
particular the stalling issue disappears with SSD.

The effect is significant on a series of tests shown below with scale 10
pgbench on an (old) dedicated host (8 GB memory, 8 cores, ext4 over hw
RAID), with shared_buffers=1GB checkpoint_completion_target=0.8
completion_timeout=30s, unless stated otherwise.

Note: I know that this completion_timeout is too small for a normal
config, but the point is to test how checkpoints behave, so the test
triggers as many checkpoints as possible, hence the minimum timeout
setting. I have also done some tests with larger timeout.

(1) THROTTLED PGBENCH

The objective of the patch is to be able to reduce the latency of transactions
under a moderate load. These first serie of tests focuses on this point with
the help of pgbench -R (rate) and -L (skip/count late transactions).
The measure counts transactions which were skipped or beyond the expected
latency limit while targetting a transaction rate.

* "pgbench -M prepared -N -T 100 -P 1 -R 100 -L 100" (100 tps targeted during
100 seconds, and latency limit is 100 ms), over 256 runs, 7 hours per case:

flush | percent of skipped
cp | bgw | & out of latency limit transactions
off | off | 6.5 %
off | on | 6.1 %
on | off | 0.4 %
on | on | 0.4 %

* Same as above (100 tps target) over one run of 4000 seconds with
shared_buffers=256MB and checkpoint_timeout=10mn:

flush | percent of skipped
cp | bgw | & out of latency limit transactions
off | off | 1.3 %
off | on | 1.5 %
on | off | 0.6 %
on | on | 0.6 %

* Same as first one but with "-R 150", i.e. targetting 150 tps, 256 runs:

flush | percent of skipped
cp | bgw | & out of latency limit transactions
off | off | 8.0 %
off | on | 8.0 %
on | off | 0.4 %
on | on | 0.4 %

* Same as above (150 tps target) over one run of 4000 seconds with
shared_buffers=256MB and checkpoint_timeout=10mn:

flush | percent of skipped
cp | bgw | & out of latency limit transactions
off | off | 1.7 %
off | on | 1.9 %
on | off | 0.7 %
on | on | 0.6 %

Turning "checkpoint_flush_to_disk = on" reduces significantly the number
of late transactions. These late transactions are not uniformly distributed,
but are rather clustered around times when pg is stalled, i.e. more or less
unresponsive.

bgwriter_flush_to_disk does not seem to have a significant impact on these
tests, maybe because pg shared_buffers size is much larger than the
database, so the bgwriter is seldom active.

(2) FULL SPEED PGBENCH

This is not the target use case, but it seems necessary to assess the
impact of these options of tps figures and their variability.

* "pgbench -M prepared -N -T 100 -P 1" over 512 runs, 14 hours per case.

flush | performance on ...
cp | bgw | 512 100-seconds runs | 1s intervals (over 51200 seconds)
off | off | 691 +- 36 tps | 691 +- 236 tps
off | on | 677 +- 29 tps | 677 +- 230 tps
on | off | 655 +- 23 tps | 655 +- 130 tps
on | on | 657 +- 22 tps | 657 +- 130 tps

On this first test, setting checkpoint_flush_to_disk reduces the performance by
5%, but the per second standard deviation is nearly halved, that is the
performance is more stable over the runs, although lower.
Option bgwriter_flush_to_disk effect is inconclusive.

* "pgbench -M prepared -N -T 4000 -P 1" on only 1 (long) run, with
checkpoint_timeout=10mn and shared_buffers=256MB (at least 6 checkpoints
during the run, probably more because segments are filled more often than
every 10mn):

flush | performance ... (stddev over per second tps)
off | off | 877 +- 179 tps
off | on | 880 +- 183 tps
on | off | 896 +- 131 tps
on | on | 888 +- 132 tps

On this second short test, setting checkpoint_flush_to_disk seems to maybe
slightly improve performance (maybe 2% ?) and significantly reduces
variability, so it looks like a good move.

* "pgbench -M prepared -N -T 100 -j 2 -c 4 -P 1" over 32 runs (4 clients)

flush | performance on ...
cp | bgw | 32 100-seconds runs | 1s intervals (over 3200 seconds)
off | off | 1970 +- 60 tps | 1970 +- 783 tps
off | on | 1928 +- 61 tps | 1928 +- 813 tps
on | off | 1578 +- 45 tps | 1578 +- 631 tps
on | on | 1594 +- 47 tps | 1594 +- 618 tps

On this test both average and standard deviation are both reduced by 20%.
This does not look like a win.

CONCLUSION

This approach is simple and significantly improves pg fsync behavior under
moderate load, where the database stays mostly responsive. Under full load,
the situation may be improved or degraded, it depends.

OTHER OPTIONS

Another idea suggested by Theodore Ts'o seems impractical: playing with
Linux io-scheduler priority (ioprio_set) looks only relevant with the
"sfq" scheduler on actual hard disk, but does not work with other
schedulers, especially "deadline" which seems more advisable for Pg, nor
for hardware RAID, which is a common setting.

Also, Theodore Ts'o suggested to use "sync_file_range" to check whether
the writes have reached the disk, and possibly to delay the actual
fsync/checkpoint conclusion if not... I have not tried that, the
implementation is not as trivial, and I'm not sure what to do when the
completion target is coming, but possibly that could be an interesting
option to investigate. Preliminary tests by adding a sleep between the
writes and the final fsync did not yield very good results.

I've also played with numerous other options (changing checkpointer
throttling parameters, reducing checkpoint timeout to 1 second, playing
around with various kernel settings), but that did not seem to be very
effective for the problem at hand.

I also attached a test script I used, that can be adapted if someone wants
to collect some performance data. I also have some basic scripts to
extract and compute stats, ask if needed.

--
Fabien.

Attachment Content-Type Size
checkpoint-continuous-flush-1.patch text/x-diff 22.2 KB
cp_test.sh application/x-sh 4.4 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2015-06-01 13:38:40 Re: [CORE] postpone next week's release
Previous Message Albe Laurenz 2015-06-01 11:00:09 Re: [NOVICE] psql readline Tab insert tab