Re: postgresql latency & bgwriter not doing its job

From: Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>
To: Claudio Freire <klaussfreire(at)gmail(dot)com>
Cc: Aidan Van Dyk <aidan(at)highrise(dot)ca>, Andres Freund <andres(at)2ndquadrant(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL Developers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: postgresql latency & bgwriter not doing its job
Date: 2014-08-28 18:04:06
Message-ID: alpine.DEB.2.10.1408281947350.28571@sto
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


>> I tried that by setting:
>> vm.dirty_expire_centisecs = 100
>> vm.dirty_writeback_centisecs = 100
>>
>> So it should start writing returned buffers at most 2s after they are
>> returned, if I understood the doc correctly, instead of at most 35s.
>>
>> The result is that with a 5000s 25tps pretty small load (the system can do
>> 300tps with the default configuration), I lost 2091 (1.7%) of transactions,
>> that is they were beyond the 200ms schedule limit. Another change is that
>> overall the lost transactions are more spread than without this setting,
>> although there still is stretches of unresponsiveness.
>>
>> So although the situation is significantly better, it is still far from good
>> with the much reduced value I tried.
>
> What do the checkpoint logs look like now? (especially interested in sync times)

Here is an extract:

LOG: checkpoint starting: xlog
LOG: checkpoint complete: wrote 4893 buffers (4.8%);
0 transaction log file(s) added, 0 removed, 0 recycled;
write=128.430 s, sync=0.378 s, total=128.921 s;
sync files=11, longest=0.375 s, average=0.034 s
LOG: checkpoint starting: xlog
LOG: checkpoint complete: wrote 6200 buffers (6.1%);
0 transaction log file(s) added, 3 removed, 0 recycled;
write=128.934 s, sync=0.240 s, total=129.280 s;
sync files=7, longest=0.132 s, average=0.034 s
LOG: checkpoint starting: xlog
LOG: checkpoint complete: wrote 6177 buffers (6.0%);
0 transaction log file(s) added, 3 removed, 0 recycled;
write=130.146 s, sync=0.322 s, total=130.592 s;
sync files=7, longest=0.185 s, average=0.046 s
LOG: checkpoint starting: xlog
LOG: checkpoint complete: wrote 6185 buffers (6.0%);
0 transaction log file(s) added, 3 removed, 0 recycled;
write=132.673 s, sync=0.143 s, total=132.909 s;
sync files=5, longest=0.078 s, average=0.028 s
LOG: checkpoint starting: xlog
LOG: checkpoint complete: wrote 6188 buffers (6.0%);
0 transaction log file(s) added, 3 removed, 0 recycled;
write=130.415 s, sync=0.170 s, total=130.676 s;
sync files=5, longest=0.132 s, average=0.034 s
LOG: checkpoint starting: xlog
LOG: checkpoint complete: wrote 6203 buffers (6.1%);
0 transaction log file(s) added, 3 removed, 0 recycled;
write=131.726 s, sync=0.444 s, total=132.256 s;
sync files=5, longest=0.441 s, average=0.088 s
...

Basically sync is between 0.1-0.5 seconds.

I had one particulary bad stretch of unresponsiveness, which is not
clearly related to a slow checkpoint sync:

progress: 4065.0 s, 28.7 tps, lat 11.886 ms stddev 48.470, lag 18.851 ms, 0 skipped
progress: 4066.2 s, 1.6 tps, lat 952.830 ms stddev 280.673, lag 155.310 ms, 25 skipped
progress: 4067.0 s, 2.7 tps, lat 1067.730 ms stddev 316.321, lag 171.766 ms, 14 skipped
progress: 4068.4 s, 1.4 tps, lat 1147.392 ms stddev 240.337, lag 132.297 ms, 35 skipped
progress: 4069.7 s, 2.3 tps, lat 1034.543 ms stddev 213.786, lag 154.453 ms, 35 skipped
progress: 4070.2 s, 1.8 tps, lat 562.715 ms stddev 0.000, lag 182.833 ms, 12 skipped
progress: 4071.3 s, 3.6 tps, lat 600.929 ms stddev 108.232, lag 162.723 ms, 25 skipped
progress: 4072.5 s, 1.7 tps, lat 1077.187 ms stddev 77.654, lag 168.849 ms, 31 skipped
progress: 4073.3 s, 1.3 tps, lat 1298.093 ms stddev 0.000, lag 168.882 ms, 21 skipped
progress: 4074.0 s, 2.7 tps, lat 920.704 ms stddev 183.587, lag 165.333 ms, 24 skipped
progress: 4075.3 s, 2.4 tps, lat 756.051 ms stddev 118.241, lag 176.863 ms, 29 skipped
progress: 4076.1 s, 1.3 tps, lat 1424.548 ms stddev 0.000, lag 0.000 ms, 17 skipped
progress: 4077.3 s, 2.4 tps, lat 791.090 ms stddev 338.729, lag 155.403 ms, 26 skipped
progress: 4078.1 s, 27.4 tps, lat 46.834 ms stddev 198.812, lag 3.915 ms, 0 skipped

--
Fabien

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Kevin Grittner 2014-08-28 18:26:53 Re: Why data of timestamptz does not store value of timezone passed to it?
Previous Message David E. Wheeler 2014-08-28 17:56:21 Re: Missing plpgsql.o Symbols on OS X