Re: Increase Vacuum ring buffer.

From: Sokolov Yura <funny(dot)falcon(at)postgrespro(dot)ru>
To: Claudio Freire <klaussfreire(at)gmail(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-Dev <pgsql-hackers(at)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers-owner(at)postgresql(dot)org
Subject: Re: Increase Vacuum ring buffer.
Date: 2017-08-15 15:00:38
Message-ID: 20170815180038.1f1953b2@falcon-work
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

В Mon, 31 Jul 2017 20:11:25 +0300
Sokolov Yura <funny(dot)falcon(at)postgrespro(dot)ru> пишет:

> On 2017-07-27 11:53, Sokolov Yura wrote:
> > On 2017-07-26 20:28, Sokolov Yura wrote:
> >> On 2017-07-26 19:46, Claudio Freire wrote:
> >>> On Wed, Jul 26, 2017 at 1:39 PM, Sokolov Yura
> >>> <funny(dot)falcon(at)postgrespro(dot)ru> wrote:
> >>>> On 2017-07-24 12:41, Sokolov Yura wrote:
> >>>> test_master_1/pretty.log
> >>> ...
> >>>> time activity tps latency stddev min max
> >>>> 11130 av+ch 198 198ms 374ms 7ms 1956ms
> >>>> 11160 av+ch 248 163ms 401ms 7ms 2601ms
> >>>> 11190 av+ch 321 125ms 363ms 7ms 2722ms
> >>>> 11220 av+ch 1155 35ms 123ms 7ms 2668ms
> >>>> 11250 av+ch 1390 29ms 79ms 7ms 1422ms
> >>>
> >>> vs
> >>>
> >>>> test_master_ring16_1/pretty.log
> >>>> time activity tps latency stddev min max
> >>>> 11130 av+ch 26 1575ms 635ms 101ms 2536ms
> >>>> 11160 av+ch 25 1552ms 648ms 58ms 2376ms
> >>>> 11190 av+ch 32 1275ms 726ms 16ms 2493ms
> >>>> 11220 av+ch 23 1584ms 674ms 48ms 2454ms
> >>>> 11250 av+ch 35 1235ms 777ms 22ms 3627ms
> >>>
> >>> That's a very huge change in latency for the worse
> >>>
> >>> Are you sure that's the ring buffer's doing and not some
> >>> methodology snafu?
> >>
> >> Well, I tuned postgresql.conf so that there is no such
> >> catastrophic slows down on master branch. (with default
> >> settings such slowdown happens quite frequently).
> >> bgwriter_lru_maxpages = 10 (instead of default 200) were one
> >> of such tuning.
> >>
> >> Probably there were some magic "border" that triggers this
> >> behavior. Tuning postgresql.conf shifted master branch on
> >> "good side" of this border, and faster autovacuum crossed it
> >> to "bad side" again.
> >>
> >> Probably, backend_flush_after = 2MB (instead of default 0) is
> >> also part of this border. I didn't try to bench without this
> >> option yet.
> >>
> >> Any way, given checkpoint and autovacuum interference could be
> >> such noticeable, checkpoint clearly should affect autovacuum
> >> cost mechanism, imho.
> >>
> >> With regards,
> >
> > I'll run two times with default postgresql.conf (except
> > shared_buffers and maintence_work_mem) to find out behavior on
> > default setting.
> >
> I've accidentally lost results of this run, so I will rerun it.
>
> This I remembered:
> - even with default settings, autovacuum runs 3 times faster:
> 9000s on master, 3000s with increased ring buffer.
> So xlog-fsync really slows down autovacuum.
> - but concurrent transactions slows down (not so extremely as in
> previous test, but still significantly).
> I could not draw pretty table now, cause I lost results. I'll do
> it after re-run completes.
>
> With regards,

Excuse me for long delay.

I did run with default postgresql.conf .

First: query was a bit different - instead of updating 5 close but
random points using `aid in (:aid1, :aid2, :aid3, :aid4, :aid5)`,
condition was `aid between (:aid1 and :aid1+9)`. TPS is much slower
(on master 330tps vs 540tps for previous version), but it is hard to
tell, is it due query difference, or is it due config change.
I'm sorry for this inconvenience :-( I will never repeat this mistake
in a future.

Overview:
master : 339 tps, average autovacuum 6000sec.
ring16 : 275 tps, average autovacuum 1100sec, first 2500sec.
ring16 + `vacuum_cost_page_dirty = 40` :
293 tps, average autovacuum 2100sec, first 4226sec.

Running with default postgresql.conf doesn't show catastrophic tps
decline when checkpoint starts during autovacuum (seen in previous test
runs with `autovacuum_cost_delay = 2ms` ). Overall average tps through
8 hours test is also much closer to "master". Still, increased ring
buffer significantly improves autovacuum performance, that means, fsync
consumes a lot of time, comparable with autovacuum_cost_delay.

Runs with ring16 has occasional jumps in minimum and average
response latency:
test_master_ring16_2/pretty.log
8475 av+ch 15 2689ms 659ms 1867ms 3575ms
27420 av 15 2674ms 170ms 2393ms 2926ms
Usually it happens close to end of autovacuum.
What could it be? It is clearly bad behavior hidden by current small
ring buffer.

Runs with ring16+`cost_page_dirty = 40` are much more stable in term
of performance of concurrent transactions. Only first autovacuum has
such "latency jump", latter runs smoothly.

So, increasing ring buffer certainly improves autovacuum performance.
Its negative effects could be compensated with configuration. It
exposes some bad behavior in current implementation, that should be
investigated closer.

--
With regards,
Sokolov Yura aka funny_falcon
Postgres Professional: https://postgrespro.ru
The Russian Postgres Company

Attachment Content-Type Size
testing6_pretty.tar.gz application/gzip 486.9 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2017-08-15 15:07:43 Re: [BUGS] Replication to Postgres 10 on Windows is broken
Previous Message Alvaro Herrera 2017-08-15 14:48:51 Re: shared memory based stat collector (was: Sharing record typmods between backends)