Re: checkpointer continuous flushing

From: Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: PostgreSQL Developers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: checkpointer continuous flushing
Date: 2015-06-07 14:53:12
Message-ID: alpine.DEB.2.10.1506071638490.11135@sto
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


Hello Andres,

> They pretty much can't if you flush things frequently. That's why I
> think this won't be acceptable without the sorting in the checkpointer.

* VERSION 2 "WORK IN PROGRESS".

The implementation is more a proof-of-concept for having feedback than
clean code. What it does:

- as version 1 : simplified asynchronous flush based on Andres Freund
patch, with sync_file_range/posix_fadvise used to hint the OS that
the buffer must be sent to disk "now".

- added: checkpoint buffer sorting based on a 2007 patch by Takahiro Itagaki
but with a smaller and static buffer allocated once. Also,
sorting is done by chunks in the current version.

- also added: sync/advise calls are now merged if possible,
so less calls are used, especially when buffers are sorted,
but also if there are few files.

* PERFORMANCE TESTS

Impacts on "pgbench -M prepared -N -P 1" scale 10 (simple update pgbench
with a mostly-write activity), with checkpoint_completion_target=0.8
and shared_buffers=1GB.

Contrary to v1, I have not tested bgwriter flushing as the impact
on the first round was close to nought. This does not mean that particular
loads may benefit or be harmed but flushing from bgwriter.

- 100 tps throttled max 100 ms latency over 6400 seconds
with checkpoint_timeout=30s

flush | sort | late transactions
off | off | 6.0 %
off | on | 6.1 %
on | off | 0.4 %
on | on | 0.4 % (93% improvement)

- 100 tps throttled max 100 ms latency over 4000 seconds
with checkpoint_timeout=10mn

flush | sort | late transactions
off | off | 1.5 %
off | on | 0.6 % (?!)
on | off | 0.8 %
on | on | 0.6 % (60% improvement)

- 150 tps throttled max 100 ms latency over 19600 seconds (5.5 hours)
with checkpoint_timeout=30s

flush | sort | late transactions
off | off | 8.5 %
off | on | 8.1 %
on | off | 0.5 %
on | on | 0.4 % (95% improvement)

- full speed bgbench over 6400 seconds with checkpoint_timeout=30s

flush | sort | tps performance over per second data
off | off | 676 +- 230
off | on | 683 +- 213
on | off | 712 +- 130
on | on | 725 +- 116 (7.2% avg/50% stddev improvements)

- full speed bgbench over 4000 seconds with checkpoint_timeout=10mn

flush | sort | tps performance over per second data
off | off | 885 +- 188
off | on | 940 +- 120 (6%/36%!)
on | off | 778 +- 245 (hmmm... not very consistent?)
on | on | 927 +- 108 (4.5% avg/43% sttdev improvements)

- full speed bgbench "-j2 -c4" over 6400 seconds with checkpoint_timeout=30s

flush | sort | tps performance over per second data
off | off | 2012 +- 747
off | on | 2086 +- 708
on | off | 2099 +- 459
on | on | 2114 +- 422 (5% avg/44% stddev improvements)

* CONCLUSION :

For all these HDD tests, when both options are activated the tps performance
is improved, the latency is reduced and the performance is more stable
(smaller standard deviation).

Overall the option effects, not surprisingly, are quite (with exceptions)
orthogonal:
- latency is essentially improved (60 to 95% reduction) by flushing
- throughput is improved (4 to 7% better) thanks to sorting

In detail, some loads may benefit more from only one option activated.
Also on SSD probably both options would have limited benefit.

Usual caveat: these are only benches on one host at a particular time and
location, which may or may not be reproducible nor be representative
as such of any other load. The good news is that all these tests tell
the same thing.

* LOOK FOR THOUGHTS

- The bgwriter flushing option seems ineffective, it could be removed
from the patch?

- Move fsync as early as possible, suggested by Andres Freund?

In these tests, when the flush option is activated, the fsync duration
at the end of the checkpoint is small: on more than 5525 checkpoint
fsyncs, 0.5% are above 1 second when flush is on, but the figure raises
to 24% when it is off.... This suggest that doing the fsync as soon as
possible would probably have no significant effect on these tests.

My opinion is that this should be left out for the nonce.

- Take into account tablespaces, as pointed out by Andres Freund?

The issue is that if writes are sorted, they are not be distributed
randomly over tablespaces, inducing lower performance on such systems.

How to do it: while scanning shared_buffers, count dirty buffers for each
tablespace. Then start as many threads as table spaces, each one doing
its own independent throttling for a tablespace? For some obscure reason
there are 2 tablespaces by default (pg_global and pg_default), that would
mean at least 2 threads.

Alternatively, maybe it can be done from one thread, but it would probably
involve some strange hocus-pocus to switch frequently between tablespaces.

--
Fabien.

Attachment Content-Type Size
checkpoint-continuous-flush-2-WIP.patch text/x-diff 34.1 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andreas Karlsson 2015-06-07 15:10:06 Re: PoC: Partial sort
Previous Message Kevin Grittner 2015-06-07 14:10:39 Re: [CORE] Restore-reliability mode