Re: checkpointer continuous flushing

From: Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: PostgreSQL Developers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: checkpointer continuous flushing
Date: 2015-06-17 06:24:38
Message-ID: alpine.DEB.2.10.1506170803210.9794@sto
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello,

Here is version 3, including many performance tests with various settings,
representing about 100 hours of pgbench run. This patch aims at improving
checkpoint I/O behavior so that tps throughput is improved, late
transactions are less frequent, and overall performances are more stable.

* SOLILOQUIZING

> - The bgwriter flushing option seems ineffective, it could be removed
> from the patch?

I did that.

> - Move fsync as early as possible, suggested by Andres Freund?
>
> My opinion is that this should be left out for the nonce.

I did that.

> - Take into account tablespaces, as pointed out by Andres Freund?
>
> Alternatively, maybe it can be done from one thread, but it would probably
> involve some strange hocus-pocus to switch frequently between tablespaces.

I did the hocus-pocus approach, including a quasi-proof (not sure what is
this mathematical object:-) in comments to show how/why it works.

* PATCH CONTENTS

- as version 1: simplified asynchronous flush based on Andres Freund
patch, with sync_file_range/posix_fadvise used to hint the OS that
the buffer must be sent to disk "now".

- as version 2: checkpoint buffer sorting based on a 2007 patch by
Takahiro Itagaki but with a smaller and static buffer allocated once.
Also, sorting is done by chunks of 131072 pages in the current version,
with a guc to change this value.

- as version 2: sync/advise calls are now merged if possible,
so less calls will be used, especially when buffers are sorted,
but also if there are few files written.

- new: the checkpointer balance its page writes per tablespace.
this is done by choosing to write pages for a tablespace for which
the progress ratio (written/to_write) is beyond the overall progress
ratio for all tablespace, and by doing that in a round robin manner
so that all tablespaces regularly get some attention. No threads.

- new: some more documentation is added.

- removed: "bgwriter_flush_to_write" is removed, as there was no clear
benefit on the (simple) tests. It could be considered for another patch.

- question: I'm not sure I understand the checkpointer memory management.
There is some exception handling in the checkpointer main. I wonder
whether the allocated memory would be lost in such event and should
be reallocated. The patch currently assumes that the memory is kept.

* PERFORMANCE TESTS

Impacts on "pgbench -M prepared -N -P 1 ..." (simple update test, mostly
random write activity on one table), checkpoint_completion_target=0.8, with
different settings on a 16GB 8-core host:

. tiny: scale=10 shared_buffers=1GB checkpoint_timeout=30s time=6400s
. small: scale=120 shared_buffers=2GB checkpoint_timeout=300s time=4000s
. medium: scale=250 shared_buffers=4GB checkpoint_timeout=15min time=4000s
. large: scale=1000 shared_buffers=4GB checkpoint_timeout=40min time=7500s

Note: figures noted with a star (*) had various issues during their run, so
pgbench progress figures were more or less incorrect, thus the standard
deviation computation is not to be trusted beyond "pretty bad".

Caveat: these are only benches on one host at a particular time and
location, which may or may not be reproducible nor be representative
as such of any other load. The good news is that all these tests tell
the same thing.

- full-speed 1-client

options | tps performance over per second data
flush | sort | tiny | small | medium | large
off | off | 687 +- 231 | 163 +- 280 * | 191 +- 626 * | 37.7 +- 25.6
off | on | 699 +- 223 | 457 +- 315 | 479 +- 319 | 48.4 +- 28.8
on | off | 740 +- 125 | 143 +- 387 * | 179 +- 501 * | 37.3 +- 13.3
on | on | 722 +- 119 | 550 +- 140 | 549 +- 180 | 47.2 +- 16.8

- full speed 4-clients

options | tps performance over per second data
flush | sort | tiny | small | medium
off | off | 2006 +- 748 | 193 +- 1898 * | 205 +- 2465 *
off | on | 2086 +- 673 | 819 +- 905 * | 807 +- 1029 *
on | off | 2212 +- 451 | 169 +- 1269 * | 160 +- 502 *
on | on | 2073 +- 437 | 743 +- 413 | 822 +- 467

- 100-tps 1-client max 100-ms latency

options | percent of late transactions
flush | sort | tiny | small | medium
off | off | 6.31 | 29.44 | 30.74
off | on | 6.23 | 8.93 | 7.12
on | off | 0.44 | 7.01 | 8.14
on | on | 0.59 | 0.83 | 1.84

- 200-tps 1-client max 100-ms latency

options | percent of late transactions
flush | sort | tiny | small | medium
off | off | 10.00 | 50.61 | 45.51
off | on | 8.82 | 12.75 | 12.89
on | off | 0.59 | 40.48 | 42.64
on | on | 0.53 | 1.76 | 2.59

- 400-tps 1-client (or 4 for medium) max 100-ms latency

options | percent of late transactions
flush | sort | tiny | small | medium
off | off | 12.0 | 64.28 | 68.6
off | on | 11.3 | 22.05 | 22.6
on | off | 1.1 | 67.93 | 67.9
on | on | 0.6 | 3.24 | 3.1

* CONCLUSION :

For most of these HDD tests, when both options are activated the tps
throughput is improved (+3 to +300%), late transactions are reduced (by
91% to 97%) and overall the performance is more stable (tps standard
deviation is typically halved).

The option effects are somehow orthogonal:

- latency is essentially limited by flushing, although sorting also
contributes.

- throughput is mostly improved thanks to sorting, with some occasional
small positive or negative effect from flushing.

In detail, some loads may benefit more from only one option activated. In
particular, flushing may have a small adverse effect on throughput in some
conditions, although not always. With SSD probably both options would
probably have limited benefit.

--
Fabien.

Attachment Content-Type Size
checkpoint-continuous-flush-3.patch text/x-diff 40.9 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2015-06-17 06:48:58 Re: pg_rewind and xlogtemp files
Previous Message Michael Paquier 2015-06-17 06:17:37 pg_rewind and xlogtemp files