Re: BUG #13750: Autovacuum slows down with large numbers of tables. More workers makes it slower.

From: David Gould <daveg(at)sonic(dot)net>
To: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Pg Bugs <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: BUG #13750: Autovacuum slows down with large numbers of tables. More workers makes it slower.
Date: 2015-10-31 08:37:14
Message-ID: 20151031013714.7e29af3b@engels
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On Fri, 30 Oct 2015 23:19:52 -0700
David Gould <daveg(at)sonic(dot)net> wrote:

> On Fri, 30 Oct 2015 21:49:00 -0700
> Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:

> > The attached patch does that. In a system with 4 CPUs and that had
> > 100,000 tables, with a big chunk of them in need of vacuuming, and
> > with 30 worker processes, this increased the throughput by a factor of
> > 40. Presumably it will do even better with more CPUs.
> >
> > It is still horribly inefficient, but 40 times less so.
>
> That is a good result for such a small change.
>
> The attached patch against REL9_5_STABLE_goes a little further. It
> claims the table under the lock, but also addresses the problem of all the
> workers racing to redo the same table by enforcing an ordering on all the
> workers. No worker can claim a table with an oid smaller than the highest
> oid claimed by any worker. That is, instead of racing to the same table,
> workers leapfrog over each other.
>
> In theory the recheck of the stats could be eliminated although this patch
> does not do that. It does eliminate the special handling of stats snapshots
> for autovacuum workers which cuts back on the excess rewriting of the stats
> file somewhat.
>
> I'll send numbers shortly, but as I recall it is over 100 times better than
> the original.

As promised here are numbers. The setup is a 2 core haswell i3 with a
single SSD. The system is fanless, so it slows down after a few minutes of
load. The database has 40,000 tiny tables freshly created. Autovacuum will
try to analyze them, but that is not much work per table so the number of
tables analyzed per minute is a pretty good measure of the recheck
overhead and contention among the workers.

Unpatched postgresql 9.5beta1 (I let it run for over an hour but it did not
get very far):

seconds elapsed actions chunk sec/av av/min
430.1 430.1 1000 1000 0.430 139.5
1181.2 751.1 2000 1000 0.751 79.9
1954.0 772.7 3000 1000 0.773 77.6
2618.5 664.5 4000 1000 0.664 90.3
3305.7 687.2 5000 1000 0.687 87.3
4010.1 704.4 6000 1000 0.704 85.2

A ps sample from partway through the run. Most of the cpu used is by
the stats collector:
$ ps xww | awk '/collector|autovacuum worker/ && !/awk/'
30212 ? Ss 0:00 postgres: autovacuum launcher process
30213 ? Ds 0:55 postgres: stats collector process
30221 ? Ss 0:23 postgres: autovacuum worker process avac
30231 ? Ss 0:12 postgres: autovacuum worker process avac
30243 ? Ss 0:11 postgres: autovacuum worker process avac
30257 ? Ss 0:10 postgres: autovacuum worker process avac

postgresql 9.5beta1 plus my ordered oids/high watermark autovacuum patch:

seconds elapsed actions chunk sec/av av/min
13.4 13.4 1000 1000 0.013 4471.9
22.9 9.5 2000 1000 0.010 6299.9
31.9 8.9 3000 1000 0.009 6718.9
40.2 8.3 4000 1000 0.008 7220.2
52.2 12.1 5000 1000 0.012 4973.1
59.5 7.2 6000 1000 0.007 8318.3
69.4 10.0 7000 1000 0.010 6024.7
78.9 9.5 8000 1000 0.010 6311.8
93.5 14.6 9000 1000 0.015 4105.1
104.3 10.7 10000 1000 0.011 5601.7
114.4 10.2 11000 1000 0.010 5887.0
127.5 13.1 12000 1000 0.013 4580.9
140.1 12.6 13000 1000 0.013 4763.0
153.8 13.7 14000 1000 0.014 4388.9
166.7 12.9 15000 1000 0.013 4638.6
181.6 14.8 16000 1000 0.015 4043.9
200.9 19.3 17000 1000 0.019 3113.5
217.5 16.7 18000 1000 0.017 3600.8
231.5 14.0 19000 1000 0.014 4285.7
245.5 14.0 20000 1000 0.014 4286.3
259.0 13.5 21000 1000 0.013 4449.7
274.5 15.5 22000 1000 0.015 3874.2
292.5 18.0 23000 1000 0.018 3332.4
311.3 18.8 24000 1000 0.019 3190.3
326.1 14.8 25000 1000 0.015 4047.8
345.1 19.0 26000 1000 0.019 3158.1
363.5 18.3 27000 1000 0.018 3270.6
382.4 18.9 28000 1000 0.019 3167.6
403.4 21.0 29000 1000 0.021 2855.0
419.6 16.2 30000 1000 0.016 3701.6

A ps sample from partway through the run. Most of the cpu used is by
workers, not the collector.
$ ps xww | awk '/collector|autovacuum worker/ && !/awk/'
872 ? Ds 0:49 postgres: stats collector process
882 ? Ds 3:42 postgres: autovacuum worker process avac
953 ? Ds 3:21 postgres: autovacuum worker process avac
1062 ? Ds 2:56 postgres: autovacuum worker process avac
1090 ? Ds 2:34 postgres: autovacuum worker process avac

It seems to slow down a bit after a few minutes. I think this may be
because of filling the OS page cache with dirty pages as it is fully IO
bound for most of the test duration. Or possibly cpu throttling. I'll see
about retesting on better hardware.

-dg

--
David Gould 510 282 0869 daveg(at)sonic(dot)net
If simplicity worked, the world would be overrun with insects.

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message David Gould 2015-10-31 10:20:51 Re: BUG #13750: Autovacuum slows down with large numbers of tables. More workers makes it slower.
Previous Message David Gould 2015-10-31 07:01:08 Re: BUG #13750: Autovacuum slows down with large numbers of tables. More workers makes it slower.