Re: We probably need autovacuum_max_wraparound_workers

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Josh Berkus <josh(at)agliodbs(dot)com>
Cc: PgHacker <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: We probably need autovacuum_max_wraparound_workers
Date: 2012-06-28 23:48:55
Message-ID: 6551.1340927335@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Josh Berkus <josh(at)agliodbs(dot)com> writes:
> So there are two parts to this problem, each of which needs a different
> solution:

> 1. Databases can inadvertently get to the state where many tables need
> wraparound vacuuming at exactly the same time, especially if they have
> many "cold" data partition tables.

I'm not especially sold on your theory that there's some behavior that
forces such convergence, but it's certainly plausible that there was,
say, a schema alteration applied to all of those partitions at about the
same time. In any case, as Robert has been saying, it seems like it
would be smart to try to get autovacuum to spread out the
anti-wraparound work a bit better when it's faced with a lot of tables
with similar relfrozenxid values.

> 2. When we do hit wraparound thresholds for multiple tables, autovacuum
> has no hesitation about doing autovacuum_max_workers worth of wraparound
> vacuum simultaneously, even when that exceeds the I/O capactity of the
> system.

I continue to maintain that this problem is unrelated to wraparound as
such, and that thinking it is is a great way to design a bad solution.
There are any number of reasons why autovacuum might need to run
max_workers at once. What we need to look at is making sure that they
don't run the system into the ground when that happens.

Since your users weren't complaining about performance with one or two
autovac workers running (were they?), we can assume that the cost-delay
settings were such as to not create a problem in that scenario. So it
seems to me that it's down to autovac_balance_cost(). Either there's
a plain-vanilla bug in there, or seek costs are breaking the assumption
that it's okay to give N workers each 1/Nth of the single-worker I/O
capacity.

As far as bugs are concerned, I wonder if the premise of the calculation

* The idea here is that we ration out I/O equally. The amount of I/O
* that a worker can consume is determined by cost_limit/cost_delay, so we
* try to equalize those ratios rather than the raw limit settings.

might be wrong in itself? The ratio idea seems plausible but ...

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2012-06-29 01:35:51 Re: Notify system doesn't recover from "No space" error
Previous Message Robert Haas 2012-06-28 23:24:01 Re: initdb check_need_password fix