autovacuum prioritization

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: autovacuum prioritization
Date: 2022-01-20 19:23:45
Message-ID: CA+TgmoafJPjB3WVqB3FrGWUU4NLRc3VHx8GXzLL-JM++JPwK+Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

autovacuum, as it exists today, has some reasonable - though imperfect
- heuristics for deciding which tables need to be vacuumed or
analyzed. However, it has nothing to help it decide which tables need
to be vacuumed or analyzed first. As a result, when it's configured
aggressively enough, and has sufficient hardware resources, to do all
the things that need to get done, the system holds together reasonably
well. When it's not configured sufficiently aggressively, or when
hardware resources are lacking, the situation deteriorates. Some
deterioration is inevitable in that kind of scenario, because if you
can't do all of the things that need to be done, something will have
to get postponed, and that can cause problems. However, autovacuum
makes it worse than it needs to be by deciding which things to do
first more or less at random. When autovacuum chooses to vacuum a
table that is a little bit bloated in preference to one that is
severely bloated, or one that is bloated in place of one that is about
to cause a wraparound shutdown, users are understandably upset, and
some of them post on this mailing list about it. I am not going to try
to review in detail the history of previous threads looking for action
in this area, but there are quite a few.

In my view, previous efforts in this area have been too simplistic.
For example, it's been proposed that a table that is perceived to be
in any kind of wraparound danger ought to get top priority, but I find
that implausible. A mostly-quiescent table that is one XID past
autovacuum_freeze_max_age is not likely to be a more urgent problem
than a table that is absorbing 20,000 row updates per second. There is
a long way between the default value of autovacuum_freeze_max_age and
a wraparound shutdown, but if you don't vacuum a hotly-updated table
promptly, you will get irreversible bloat and force an application
shutdown to run VACUUM FULL. It's also been proposed that we provide
some way to let the user fix priorities, which was rightly criticized
as foisting on the user what the system ought to be figuring out for
itself. Perhaps a user-controllable prioritization system is a good
idea, but only as a way of overriding some built-in system in cases
where that goes wrong. A few people have proposed scoring systems,
which I think is closer to the right idea, because our basic goal is
to start vacuuming any given table soon enough that we finish
vacuuming it before some catastrophe strikes. The more imminent the
catastrophe, the more urgent it is to start vacuuming right away.
Also, and I think this is very important, the longer vacuum is going
to take, the more urgent it is to start vacuuming right away. If table
A will cause wraparound in 2 hours and take 2 hours to vacuum, and
table B will cause wraparound in 1 hour and take 10 minutes to vacuum,
table A is more urgent even though the catastrophe is further out.

So at a high level, I think that what we ought to do is, first, for
each table, estimate the time at which we think something bad will
occur. There are several bad events: too much bloat, XID wraparound,
MXID wraparound. We need to estimate the time at which we think each
of those things will occur, and then take the earliest of those
estimates. That's the time by which we need to have finished vacuuming
the table. Then, make an estimate of how long it will take to complete
a vacuum of the table, subtract that from the time at which we need to
be done, and that's the time by which we need to start. The earliest
need-to-start time is the highest priority table.

There are a number of problems here. One is that we actually need to
be able to estimate all the things that I just described, which will
probably require tracking statistics that we don't capture today, such
as the rate at which the system is consuming XIDs, and the rate at
which a table is bloating, rather than just the current state of
things. Unfortunately, proposing that the statistics system should
store more per-table information is a well-known way to get your patch
forcefully rejected. Maybe if we were storing and updating the
statistics data in a better way it wouldn't be an issue - so perhaps
shared memory stats collector stuff will resolve this issue somehow.
Or maybe it's not an issue anyway, since the big problem with the
statistics files is that they have to be constantly rewritten. If we
took snapshots of the values in even a relatively dumb way, they'd be
kinda bug, but they'd also be write-once. Maybe that would keep the
expense reasonable.

A second problem is that, if the earliest need-to-start time is in the
past, then we definitely are in trouble and had better get to work at
once, but if it's in the future, that doesn't necessarily mean we're
safe. If there are three tables with a need-to-finish time that is 12
hours in the future and each of them will take 11 hours to vacuum,
then every need-to-start time computed according to the algorithm
above is in the future, but in fact we're in a lot of trouble. If the
estimates are accurate, we need 3 autovacuum workers to be available
to start within 1 hour, or we're doomed. The very last thing we want
to do is wait another hour before doing anything. It's not impossible
to factor this into the calculation of need-to-start times, assuming
we know how many workers we have. For instance, if we've got tables
whose need-to-finish times are 30, 50, and 70 minutes in the future,
we can see that if each one takes 20 minutes or less to vacuum, then
the need-to-start times can just be computed by subtraction. But the
tables with 50 or 70 minute deadlines are going to take more than 20
minutes to vacuum, then we've got to back up the need-to-start times
so that we finish each table in time to start on the next one. I
haven't looked into what algorithms exist for this kind of scheduling
problem, but I feel like a literature search, or pulling out my
college textbooks, would probably turn up some useful ideas.

As soon as you see that you can't decide when you need to start on a
particular table without knowing what's going on with all the other
tables on the system, a third problem becomes apparent: you can't
figure anything out with confidence by looking at a single database,
but must rather gather information from all databases and decide to
which databases the workers should connect and what they ought to do
when they get there. And then, as tasks finish and system activity
progresses, you need to continuously update your notion of what needs
to be done next and move workers around to accommodate it. This gets
quite complicated, but that doesn't mean that it's unimportant.
There's a pretty well-known "thundering herd" type effect when every
table in the system crosses autovacuum_freeze_max_age around the same
time, and suddenly we go from not much vacuuming to a whole ton of
vacuuming all at once. A system like this could give us enough
information to spread that out in an intelligent way: we could see the
storm coming and start a single autovacuum worker working on the
problem well in advance, and then ramp up to multiple workers only if
it looks like that's not going to be enough to get the job done. I'm
not sure I want to deal with all that complexity on day one, but I
think it's important to do something about it eventually.

In the meantime, I think a sensible place to start would be to figure
out some system that makes sensible estimates of how soon we need to
address bloat, XID wraparound, and MXID wraparound for each table, and
some system that estimates how long each one will take to vacuum.
Then, even if the workers aren't talking to each other, each
individual worker can make an effort to deal with the most urgent
tasks first. I think that estimating how long it will take to vacuum a
table shouldn't be too bad: examining the visibility map and the index
sizes and thinking about the autovacuum cost limits should give us at
least some notion of what's likely to happen. Also, I think estimating
the time until XID age becomes critical isn't too bad either. First,
fix a threshold (perhaps autovacuum_max_freeze_age, maybe with a
higher-than-current value, or maybe some other threshold entirely)
that represents the target below which we always want to remain.
Second, know something about how fast the system is consuming XIDs.
Then just divide. I thought for a while that it would be too hard to
understand the XID consumption rate, because it might be very uneven,
but I think we can mitigate that problem somewhat by averaging over
relatively long time intervals. For instance, we might measure the
number of XIDs consumed in each 24 hour period, keep 7 days of data,
and then take the average of those values, or maybe better, the
highest or second-highest. That's a very small amount of data to store
and in combination with the relfrozenxid for each table, it's all we
need. We could also give the user a way to override our estimate of
the XID consumption rate, in case they have very brief, very severe
spikes. All of this can also be done for MXID age. It's estimating
that the time at which table bloat will exceed some threshold that
seems most difficult, because that seems to require measuring trends
on a per-table basis, as opposed to the XID consumption rate, which is
global.

I know that this email is kind of a giant wall of text, so my thanks
if you've read this far, and even more if you feel inspired to write
back with your own thoughts.

Thanks,

--
Robert Haas
EDB: http://www.enterprisedb.com

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2022-01-20 19:33:03 Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations
Previous Message Bossart, Nathan 2022-01-20 19:15:08 Re: Avoid erroring out when unable to remove or parse logical rewrite files to save checkpoint work