Re: checkpointer continuous flushing

From: Andres Freund <andres(at)anarazel(dot)de>
To: Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>
Cc: PostgreSQL Developers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: checkpointer continuous flushing
Date: 2015-06-19 15:33:39
Message-ID: 20150619153339.GJ29350@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2015-06-17 08:24:38 +0200, Fabien COELHO wrote:
> Here is version 3, including many performance tests with various settings,
> representing about 100 hours of pgbench run. This patch aims at improving
> checkpoint I/O behavior so that tps throughput is improved, late
> transactions are less frequent, and overall performances are more stable.

First off: This is pretty impressive stuff. Being at pgcon, I don't have
time to look into this in detail, but I do plan to comment more
extensively.

> >- Move fsync as early as possible, suggested by Andres Freund?
> >
> >My opinion is that this should be left out for the nonce.

"for the nonce" - what does that mean?

> I did that.

I'm doubtful that it's a good idea to separate this out, if you did.

> - as version 2: checkpoint buffer sorting based on a 2007 patch by
> Takahiro Itagaki but with a smaller and static buffer allocated once.
> Also, sorting is done by chunks of 131072 pages in the current version,
> with a guc to change this value.

I think it's a really bad idea to do this in chunks. That'll mean we'll
frequently uselessly cause repetitive random IO, often interleaved. That
pattern is horrible for SSDs too. We should always try to do this at
once, and only fail back to using less memory if we couldn't allocate
everything.

> * PERFORMANCE TESTS
>
> Impacts on "pgbench -M prepared -N -P 1 ..." (simple update test, mostly
> random write activity on one table), checkpoint_completion_target=0.8, with
> different settings on a 16GB 8-core host:
>
> . tiny: scale=10 shared_buffers=1GB checkpoint_timeout=30s time=6400s
> . small: scale=120 shared_buffers=2GB checkpoint_timeout=300s time=4000s
> . medium: scale=250 shared_buffers=4GB checkpoint_timeout=15min time=4000s
> . large: scale=1000 shared_buffers=4GB checkpoint_timeout=40min time=7500s

It'd be interesting to see numbers for tiny, without the overly small
checkpoint timeout value. 30s is below the OS's writeback time.

> Note: figures noted with a star (*) had various issues during their run, so
> pgbench progress figures were more or less incorrect, thus the standard
> deviation computation is not to be trusted beyond "pretty bad".
>
> Caveat: these are only benches on one host at a particular time and
> location, which may or may not be reproducible nor be representative
> as such of any other load. The good news is that all these tests tell
> the same thing.
>
> - full-speed 1-client
>
> options | tps performance over per second data
> flush | sort | tiny | small | medium | large
> off | off | 687 +- 231 | 163 +- 280 * | 191 +- 626 * | 37.7 +- 25.6
> off | on | 699 +- 223 | 457 +- 315 | 479 +- 319 | 48.4 +- 28.8
> on | off | 740 +- 125 | 143 +- 387 * | 179 +- 501 * | 37.3 +- 13.3
> on | on | 722 +- 119 | 550 +- 140 | 549 +- 180 | 47.2 +- 16.8
>
> - full speed 4-clients
>
> options | tps performance over per second data
> flush | sort | tiny | small | medium
> off | off | 2006 +- 748 | 193 +- 1898 * | 205 +- 2465 *
> off | on | 2086 +- 673 | 819 +- 905 * | 807 +- 1029 *
> on | off | 2212 +- 451 | 169 +- 1269 * | 160 +- 502 *
> on | on | 2073 +- 437 | 743 +- 413 | 822 +- 467
>
> - 100-tps 1-client max 100-ms latency
>
> options | percent of late transactions
> flush | sort | tiny | small | medium
> off | off | 6.31 | 29.44 | 30.74
> off | on | 6.23 | 8.93 | 7.12
> on | off | 0.44 | 7.01 | 8.14
> on | on | 0.59 | 0.83 | 1.84
>
> - 200-tps 1-client max 100-ms latency
>
> options | percent of late transactions
> flush | sort | tiny | small | medium
> off | off | 10.00 | 50.61 | 45.51
> off | on | 8.82 | 12.75 | 12.89
> on | off | 0.59 | 40.48 | 42.64
> on | on | 0.53 | 1.76 | 2.59
>
> - 400-tps 1-client (or 4 for medium) max 100-ms latency
>
> options | percent of late transactions
> flush | sort | tiny | small | medium
> off | off | 12.0 | 64.28 | 68.6
> off | on | 11.3 | 22.05 | 22.6
> on | off | 1.1 | 67.93 | 67.9
> on | on | 0.6 | 3.24 | 3.1
>

So you've not run things at more serious concurrency, that'd be
interesting to see.

I'd also like to see concurrent workloads with synchronous_commit=off -
I've seen absolutely horrible latency behaviour for that, and I'm hoping
this will help. It's also a good way to simulate faster hardware than
you have.

It's also curious that sorting is detrimental for full speed 'tiny'.

> * CONCLUSION :
>
> For most of these HDD tests, when both options are activated the tps
> throughput is improved (+3 to +300%), late transactions are reduced (by 91%
> to 97%) and overall the performance is more stable (tps standard deviation
> is typically halved).
>
> The option effects are somehow orthogonal:
>
> - latency is essentially limited by flushing, although sorting also
> contributes.
>
> - throughput is mostly improved thanks to sorting, with some occasional
> small positive or negative effect from flushing.
>
> In detail, some loads may benefit more from only one option activated. In
> particular, flushing may have a small adverse effect on throughput in some
> conditions, although not always.

> With SSD probably both options would probably have limited benefit.

I doubt that. Small random writes have bad consequences for wear
leveling. You might not notice that with a short tests - again, I doubt
it - but it'll definitely become visible over time.

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2015-06-19 15:43:12 Re: Inheritance planner CPU and memory usage change since 9.3.2
Previous Message Andres Freund 2015-06-19 15:21:46 Re: pg_receivexlog --create-slot-if-not-exists