Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation

From: Andres Freund <andres(at)anarazel(dot)de>
To: Peter Geoghegan <pg(at)bowt(dot)ie>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation
Date: 2023-01-14 05:55:31
Message-ID: 20230114055531.tpwfzj2s6fsk6g53@awork3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2023-01-13 19:39:41 -0800, Peter Geoghegan wrote:
> On Fri, Jan 13, 2023 at 6:09 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
> > I don't think the split is right. There's too much in 0001 - it's basically
> > introducing the terminology of 0002 already. Could you make it a much more
> > minimal change?
>
> Okay.
>
> I thought that you might say that. I really just wanted to show you
> how small the code footprint was for the more controversial part.

I don't mind you splitting this into three parts ;)

> I believe that we agree on most things when it comes to VACUUM, but
> I'm pretty sure that we still disagree about the value in setting
> autovacuum_freeze_max_age very high. I continue to believe that
> setting it over a billion or so is just a bad idea. I'm mentioning
> this only because it might give you some idea of where I'm coming from
> -- in general I believe that age-based settings often have only very
> weak relationships with what actually matters.

I think part of our difference around a high autovacuum_freeze_max_age are due
to things you're trying to address here - if no-auto-cancel is a separate
threshold from autovacuum_freeze_max_age, it's less problematic to set
autovacuum_freeze_max_age to something lower.

But yes, there's a remaining difference of opinion / experience.

> It might make sense to always give a small fixed amount of headroom
> when autovacuum_freeze_max_age is set to very high values. Maybe just
> 5 million XIDs/MXIDs. That would probably be just as effective as
> (say) 500 million in almost all cases. But then you have to accept
> another magic number.

I suspect that most systems with a high autovacuum_freeze_max_age use it
because 200m leads to too frequent autovacuums - a few million won't do much
for those.

How about a float autovacuum_no_auto_cancel_age where positive values are
treated as absolute values, and negative values are a multiple of
autovacuum_freeze_max_age? And where the "computed" age is capped at
vacuum_failsafe_age? A "failsafe" autovacuum clearly shouldn't be cancelled.

And maybe a default setting of -1.8 or so?

If a user chooses to set autovacuum_no_auto_cancel_age and vacuum_failsafe_age
to 2.1 billion, oh well, that's really not our problem.

> > 1) It regularly scares the crap out of users, even though it's normal. This
> > is further confounded by failsafe autovacuums, where a scared reaction is
> > appropriate, not being visible in pg_stat_activity.
>
> The docs actually imply that when the system reaches the point of
> entering xidStopLimit mode, you might get data corruption. Of course
> that's not true (the entire point of xidStopLimit is to avoid that),
> but apparently we like to keep users on their toes.

Well, historically there wasn't all that much protection. And I suspect there
still might be some dragons. We really need to get rid of the remaining places
that cache 32bit xids across transactions.

> > 3) Autovacuums triggered by tuple thresholds persistently getting cancelled
> > also regularly causes outages, and they make it more likely that an
> > eventual age-based vacuum will take forever.
>
> Really? Outages? I imagine that you'd have to be constantly hammering
> the table with DDL before it could happen. That's possible, but it
> seems relatively obvious that doing that is asking for trouble.

Yes, due to queries slowing down due to the bloat, or due to running out of
space.

I've seen a ~2TB table grow to ~20TB due to dead tuples, at which point the
server crash-restarted due to WAL ENOSPC. I think in that case there wasn't
even DDL, they just needed to make the table readonly, for a brief moment, a
few times a day. The problem started once the indexes grew to be large enough
that the time for (re-)finding dead tuples + and an index scan phase got large
enough that they were unlucky to be killed a few times in a row. After that
autovac never got through the index scan phase. Not doing any "new" work, just
collecting the same dead tids over and over, scanning the indexes for those
tids, never quite finishing.

You really don't need that frequent lock conflicts. It just needs to be more
frequent than what the index scans portion of vacuuming takes, which can be a
long time with large indexes.

> > Aspect 1) is addressed to a good degree by the proposed split of anti-wrap
> > into an age and anti-cancel triggers. And could be improved by reporting
> > failsafe autovacuums in pg_stat_activity.
>
> What you call aspect 2 (the issue with disastrous HW lock traffic jams
> involving TRUNCATE being run from a cron job, etc) is a big goal of
> mine for this patch series. You seem unsure of how effective my
> approach (or an equally simple approach based on table age heuristics)
> will be. Is that the case?

I am was primarily concerned about situations where an admin interactively was
trying to get rid of the table holding back the xid horizon, after some
problem caused a lot of work to pile up.

If you know postgres internals, it's easy. You know that autovac stops
auto-cancelling after freeze_max_age and you know that the lock queue is
ordered. So you issue DROP TABLE / TRUNCATE / whatever and immediately
afterwards cancel the autovac worker. But if you aren't, it'll take a while to
figure out that the DROP TABLE isn't progressing due to a lock conflict, at
which point you'll cancel the statement (possibly having wrecked havoc with
all other accesses). Then you figure out that you need to cancel
autovacuum. After you try the DROP TABLE again - but the next worker has
gotten to work on the table.

Brr.

The bad cousin of this is when you can't even drop the table due to "not
accepting commands". I don't know if you've seen people try to start postgres
in single user mode, under pressure, for the first time. It ain't pretty.

> > I have two ideas around 3):
> >
> > First, we could introduce thresholds for the tuple thresholds, after which
> > autovacuum isn't concealable anymore.
>
> Do you think that's a good idea? I just don't trust those statistics,
> at all. As in, I think they're often complete garbage.

I have seen them be reasonably accurate in plenty busy systems. The most
common problem I've seen is that the amount of assumed dead / inserted tuples
is way too high, because the server crash-restarted at some point. We now
have most of the infrastructure for using a slightly older version of stats
after a crash, which'd make them less inaccurate.

I haven't recently seen crazy over-estimates of the number of dead tuples, at
least if you count dead tids as tuples. It might be worth to split the dead
tuple count into a dead tuples and dead items.

Are you mostly seen over or under estimates?

> > Second, we could track the number of cancellations since the last [auto]vacuum
> > in pgstat, and only trigger the anti-cancel behaviour when autovacuum has been
> > cancelled a number of times.
>
> In theory that would be better than an approach along the lines I've
> proposed, because it is directly based on a less aggressive approach
> being tried a few times, and failing a few times. That part I like.
>
> However, I also don't like several things about this approach. First
> of all it relies on relatively complicated infrastructure, for
> something that can be critical.

I don't think it relies on much more machinery than we already rely on? The
dead/inserted tuple estimates and come from pgstat already. The only really
new piece would be that the worker would need to do a
pgstat_report_autovac_failure() after getting cancelled, which doesn't seem
too hard?

> Second, it will be hard to test.

It doesn't seem too bad. A short naptime, cancelling autovac in a loop, and
the threshold should quickly be reached? We also could add a helper function
to artificially increase the failure count.

> Third, perhaps it would make sense to give the earlier/less aggressive
> approach (a table age av that is still autocancellable) quite a large
> number of attempts before giving up. If the table age isn't really
> growing too fast, why not continue to be patient, possibly for quite a
> long time?

Yea, that's true. But it's also easy to get to the point that you collect so
much "debt" that it'll be hard to get out of. Particularly once the dead items
space doesn't fit all the dead tids, the subsequent parts of the table won't
get processed by vacuum if it frequently is cancelled before the index scans
end, leading to a huge amount fo work later.

> Perhaps a hybrid strategy could be useful? Something like what I came
> up with already, *plus* a mechanism that gives up after (say) 1000
> cancellations, and escalates to no-auto-cancel, regardless of table
> age. It seems sensible to assume that a less aggressive approach is
> just hopeless relatively quickly (in wall clock time and logical XID
> time) once we see sufficiently many cancellations against the same
> table.

Maybe. Making it explainable is presumably the hard part. We've historically
failed to make this area understandable, so maybe we don't need to try :)

Somehow this made me think of a somewhat crazy, and largely unrelated, idea:
Why don't we use the the currently unused VM bit combination to indicate pages
with dead tids? We could have an [auto]vacuum mode where it scans just pages
with the dead tids bit set. Particularly when on-access heap pruning is doing
most of the work, that could be quite useful to more cheaply get rid of the
dead tids. Obviously we'd also set them when vacuum decides / was told not to
do index cleanup. Yes, it'd obviously be less effective than lots of the
things we discussed in this area (needing to re-collect the dead tids on the
indicated), but it'd have the advantage of not needing a lot of new
infrastructure.

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2023-01-14 06:01:23 Re: Add BufFileRead variants with short read and EOF detection
Previous Message Andrey Borodin 2023-01-14 04:14:46 Re: Amcheck verification of GiST and GIN