Re: checkpointer continuous flushing

From: Andres Freund <andres(at)anarazel(dot)de>
To: Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>
Cc: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, PostgreSQL Developers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: checkpointer continuous flushing
Date: 2016-01-11 13:45:16
Message-ID: 20160111134516.imdpaeynxpfggdvx@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2016-01-09 16:49:56 +0100, Fabien COELHO wrote:
>
> Hello Andres,
>
> >Hm. New theory: The current flush interface does the flushing inside
> >FlushBuffer()->smgrwrite()->mdwrite()->FileWrite()->FlushContextSchedule(). The
> >problem with that is that at that point we (need to) hold a content lock
> >on the buffer!
>
> You are worrying that FlushBuffer is holding a lock on a buffer and the
> "sync_file_range" call occurs is issued at that moment.
>
> Although I agree that it is not that good, I would be surprise if that was
> the explanation for a performance regression, because the sync_file_range
> with the chosen parameters is an async call, it "advises" the OS to send the
> file, but it does not wait for it to be completed.

I frequently see sync_file_range blocking - it waits till it could
submit the writes into the io queues. On a system bottlenecked on IO
that's not always possible immediately.

> Also, maybe you could answer a question I had about the performance
> regression you observed, I could not find the post where you gave the
> detailed information about it, so that I could try reproducing it: what are
> the exact settings and conditions (shared_buffers, pgbench scaling, host
> memory, ...), what is the observed regression (tps? other?), and what is the
> responsiveness of the database under the regression (eg % of seconds with 0
> tps for instance, or something like that).

I measured it in a different number of cases, both on SSDs and spinning
rust. I just reproduced it with:

postgres-ckpt14 \
-D /srv/temp/pgdev-dev-800/ \
-c maintenance_work_mem=2GB \
-c fsync=on \
-c synchronous_commit=off \
-c shared_buffers=2GB \
-c wal_level=hot_standby \
-c max_wal_senders=10 \
-c max_wal_size=100GB \
-c checkpoint_timeout=30s

Using a fresh cluster each time (copied from a "template" to save time)
and using
pgbench -M prepared -c 16 -j16 -T 300 -P 1
I get

My laptop 1 EVO 840, 1 i7-4800MQ, 16GB ram:
master:
scaling factor: 800
query mode: prepared
number of clients: 16
number of threads: 16
duration: 300 s
number of transactions actually processed: 1155733
latency average: 4.151 ms
latency stddev: 8.712 ms
tps = 3851.242965 (including connections establishing)
tps = 3851.725856 (excluding connections establishing)

ckpt-14 (flushing by backends disabled):
scaling factor: 800
query mode: prepared
number of clients: 16
number of threads: 16
duration: 300 s
number of transactions actually processed: 855156
latency average: 5.612 ms
latency stddev: 7.896 ms
tps = 2849.876327 (including connections establishing)
tps = 2849.912015 (excluding connections establishing)

My laptop 1 850 PRO, 1 i7-4800MQ, 16GB ram:
master:
transaction type: TPC-B (sort of)
scaling factor: 800
query mode: prepared
number of clients: 16
number of threads: 16
duration: 300 s
number of transactions actually processed: 2104781
latency average: 2.280 ms
latency stddev: 9.868 ms
tps = 7010.397938 (including connections establishing)
tps = 7010.475848 (excluding connections establishing)

ckpt-14 (flushing by backends disabled):
scaling factor: 800
query mode: prepared
number of clients: 16
number of threads: 16
duration: 300 s
number of transactions actually processed: 1930716
latency average: 2.484 ms
latency stddev: 7.303 ms
tps = 6434.785605 (including connections establishing)
tps = 6435.177773 (excluding connections establishing)

In neither case there are periods of 0 tps, but both have times of <
1000 tps with noticeably increased latency.

The endresults are similar with a sane checkpoint timeout - the tests
just take much longer to give meaningful results. Constantly running
long tests on prosumer level SSDs isn't nice - I've now killed 5 SSDs
with postgres testing...

As you can see there's roughly a 30% performance regression on the
slower SSD and a ~9% on the faster one. HDD results are similar (but I
can't repeat on the laptop right now since the 2nd hdd is now an SSD).

My working copy of checkpoint sorting & flushing currently results in:
My laptop 1 EVO 840, 1 i7-4800MQ, 16GB ram:
transaction type: TPC-B (sort of)
scaling factor: 800
query mode: prepared
number of clients: 16
number of threads: 16
duration: 300 s
number of transactions actually processed: 1136260
latency average: 4.223 ms
latency stddev: 8.298 ms
tps = 3786.696499 (including connections establishing)
tps = 3786.778875 (excluding connections establishing)

My laptop 1 850 PRO, 1 i7-4800MQ, 16GB ram:
transaction type: TPC-B (sort of)
scaling factor: 800
query mode: prepared
number of clients: 16
number of threads: 16
duration: 300 s
number of transactions actually processed: 2050661
latency average: 2.339 ms
latency stddev: 7.708 ms
tps = 6833.593170 (including connections establishing)
tps = 6833.680391 (excluding connections establishing)

My version of the patch currently addresses various points, which need
to be separated and benchmarked separate:
* Different approach to background writer, trying to make backends write
less. While that proves to be beneficial in isolation, on its own that
doesn't address the performance regression.
* Different flushing API, done outside the lock

So this partially addresses the performance problems, but not yet
completely.

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Vladimir Sitnikov 2016-01-11 13:54:16 Re: Re: 9.4-1207 behaves differently with server side prepared statements compared to 9.2-1102
Previous Message Dave Cramer 2016-01-11 13:33:52 Re: Re: 9.4-1207 behaves differently with server side prepared statements compared to 9.2-1102