Re: [HACKERS] 答复:[HACKERS] 答复:[HACKERS] about fsync in CLOG buffer write

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: 周正中(德歌) <dege(dot)zzz(at)alibaba-inc(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, 张广舟(明虚) <guangzhou(dot)zgz(at)alibaba-inc(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, 范孝剑(康贤) <funnyxj(dot)fxj(at)alibaba-inc(dot)com>, 曾文旌(义从) <wenjing(dot)zwj(at)alibaba-inc(dot)com>, 窦贤明(执白) <xianming(dot)dxm(at)alibaba-inc(dot)com>, 萧少聪(铁庵) <shaocong(dot)xsc(at)alibaba-inc(dot)com>, 陈新坚(惧留孙) <xinjian(dot)chen(at)alibaba-inc(dot)com>
Subject: Re: [HACKERS] 答复:[HACKERS] 答复:[HACKERS] about fsync in CLOG buffer write
Date: 2015-09-09 14:27:06
Message-ID: CA+TgmoZYEe3apby_nYMe1eTBY12VVud7ZA-ApJodksUpKeZ5Kw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Sep 8, 2015 at 2:28 PM, Andres Freund <andres(at)anarazel(dot)de> wrote:
> On 2015-09-08 15:58:26 +0800, 周正中(德歌) wrote:
>> postgres(at)digoal-> cat 7.sql
>> select txid_current();
>>
>> postgres(at)digoal-> pgbench -M prepared -n -r -P 1 -f ./7.sql -c 1 -j 1 -T 100000
>> About 32K tps.
>> progress: 240.0 s, 31164.4 tps, lat 0.031 ms stddev 0.183
>> progress: 241.0 s, 33243.3 tps, lat 0.029 ms stddev 0.127
>
> So you're benchmarking how fast you can assign txids. Is that actually
> something meaningful? If you have other writes interleaved you'll write
> much more WAL, so there'll be checkpoints and such.
>
> FWIW, if you measure something realistic and there's checkpoints,
> there'll be fewer fsyncs if you increase the slru buffer size - as
> there'll often be clean buffers due to the checkpoint having written
> them out.

But I think it's not very hard to come up with a workload where 32
clog buffers isn't enough, and you end up waiting for backends to
fsync(). Suppose you have a pgbench workload with N tuples. We
update tuples at random, so sometimes we hit one that's just recently
been updated, and other times we hit one that hasn't been updated for
many transactions. At 32k transactions per page, each transaction's
chance of updating a tuple whose existing xmin is on the most recent
clog page is 32k/N. The chance of hitting a tuple whose existing xmin
is on the next page back is (1 - 32k/N) * 32k/N. The chance of
hitting a page whose current xmin is at least X pages prior to the
most recent one is (1 - 32k/N)^X. So, how many tuples do we need in
order for each update to have a 50% chance of hitting a tuple that is
at least 32 pages back?

(1 - 32k/N)^32 = .5
1 - 32k/N = 0.9785720620877
32k/N = 0.0214279379122999
N = 32k/0.0214279379122999
N = 1529218.54329206

...or in other words, scale factor 16. At scale factors >= 1044, the
chance that the next update hits an xmin more than 32 clog buffers
back is > 99%. So any test of this sort causes extremely rapid clog
page eviction - basically every transaction is going to request a
buffer that's probably not cached, and as soon as it's done with it,
some other transaction will evict it to bring in a different buffer
that's not cached.

How often such a workload actually has to replace a *dirty* clog
buffer obviously depends on how often you checkpoint, but if you're
getting ~28k TPS you can completely fill 32 clog buffers (1 million
transactions) in less than 40 seconds, and you're probably not
checkpointing nearly that often.

I'm not entirely sure how much fsync absorption for SLRU buffers will
help, but I think it would be worth trying. Back when I was spending
more time on this area, I saw some evidence that those fsyncs did
cause at least some latency spikes.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2015-09-09 14:27:50 Re: Counting lines correctly in psql help displays
Previous Message Pavel Stehule 2015-09-09 14:10:02 Re: [patch] Proposal for \rotate in psql