Re: BUG #13750: Autovacuum slows down with large numbers of tables. More workers makes it slower.

From: David Gould <daveg(at)sonic(dot)net>
To: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Pg Bugs <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: BUG #13750: Autovacuum slows down with large numbers of tables. More workers makes it slower.
Date: 2015-10-31 07:01:08
Message-ID: 20151031000108.20fa7635@engels
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On Fri, 30 Oct 2015 22:16:04 -0700
Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:

> On Fri, Oct 30, 2015 at 9:51 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> > Alvaro Herrera <alvherre(at)2ndquadrant(dot)com> writes:
> >> Tom Lane wrote:
> >>> Good point ... shouldn't we have already checked the stats before ever
> >>> deciding to try to claim the table?
> >
> >> The second check is there to allow for some other worker (or manual
> >> vacuum) having vacuumed it after we first checked, but which had
> >> finished before we check the array of current jobs.
> >
> > I wonder whether that check costs more than it saves.
>
> A single autovacuum worker can run for hours or days. I don't think
> we should start vacuuming a TB size table because it needed vacuuming
> days ago, when the initial to-do list was built up, but no longer
> does. So some kind of recheck is needed.
>
> I thought of making the recheck first use whichever snapshot we
> currently have hanging around, and then only if it still needs
> vacuuming force a fresh snapshot and re-re-check. The problem with
> that is that any previous snapshot is aggressively destroyed at the
> end of each vacuum or analyze by the EOXact code. So we don't
> actually have a snapshot hanging around to use until we go to the work
> of re-parsing the database stats file. So you have to take special
> steps to exempt AutoVacuumWorker from EOXact code clearing out the
> stats, and I don't know what the repercussions of that might be.
>
> We could also relax the freshness requirements of even the final
> re-check, perhaps dynamically. No point in re-parsing a 40MB stats
> file to avoid unnecessary vacuuming a 16KB table. But parsing a 8KB
> stats files to avoid unnecessary vacuuming of a 1TB table is well
> worth it. But that runs into the same problem as above. Once you
> have destroyed your previous stats snapshot, you no longer have the
> ability to accept stale stats any longer.

I've been thinking about this and frankly the whole scheme is a bit
suspect. There are a couple of problems:

1. In an 8TB database with 200,000 tables it is not reasonable to build a
list in advance of all the tables to vacuum. Inevitably some workers
will be working off very stale lists.

2. Autovacuum workers do not need a statistics snapshot. They only need the
current statistics for the table they are considering vacuuming.

I think the following might be too big to back patch or maybe even for
9.5, but I think a better overall scheme would look something like:

- The stats collector accepts a new request message, "give me the stats
for this: (dboid, table oid)" and handles it by replying to the sender
(autovacuum worker) with the current statistics entry for that table.

- Instead of the worker building a big list of work up front, it just keeps
track of a shared highwater mark of tables worked on and does an index
scan for the next higher oid in pg_class. This is similar to my patch in
that vacuuming is ordered so that workers never contend, but instead of
working from a list they look up the next table in pg_class each time
using the oid index. For each table that might need vacuuming it would
send the stats collector the request for that tables statistics and use
the reply to make determine what action to take.

This avoids the stale work list and the need for rechecking. Also, it
eliminates all the flogging of the stats files (I suspect SSD owners
and laptop users will rejoice!).

-dg

--
David Gould 510 282 0869 daveg(at)sonic(dot)net
If simplicity worked, the world would be overrun with insects.

In response to

Browse pgsql-bugs by date

  From Date Subject
Next Message David Gould 2015-10-31 08:37:14 Re: BUG #13750: Autovacuum slows down with large numbers of tables. More workers makes it slower.
Previous Message David Gould 2015-10-31 06:41:40 Re: BUG #13750: Autovacuum slows down with large numbers of tables. More workers makes it slower.