Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Jeff Davis <pgsql(at)j-davis(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation
Date: 2023-01-12 22:12:31
Message-ID: CAH2-Wz=sBf1nU6uspvFybgFBy7mG48wCbLnPJKxgRV2j1ZekJw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Jan 12, 2023 at 1:08 PM Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> I doubt it. Wiggle room that's based on the XID threshold being
> different for one behavior vs. another can easily fail to produce any
> benefit, because there's no guarantee that the autovacuum launcher
> will ever try to launch a worker against that table while the XID is
> in the range where you'd get one behavior and not the other.

Of course it's true that in general it might not succeed in
forestalling the auto cancellation behavior. You can say something
similar about approximately anything like this. For example, there is
no absolute guarantee that any autovacuum will ever complete. But we
still try!

> I've long thought that the fact that vacuum_freeze_table_age is documented
> as capped at 0.95 * autovacuum_freeze_max_age is silly for just this
> reason. The interval that you're proposing is much wider so the
> chances of getting a benefit are greater, but supposing that it's
> going to solve it in most cases seems like an exercise in unwarranted
> optimism.

I don't claim to be dealing in certainties, especially about the final
outcome. Whether or not you accept my precise claim is perhaps not
important, in the end. What is important is that we give things a
chance to succeed, based on the information that we have available,
with a constant eye towards avoiding disaster scenarios.

Some of the problems with VACUUM seem to be cases where VACUUM takes
on a potentially ruinous obligation, that it cannot possibly meet in
some rare cases that do come up sometimes -- like the cleanup lock
behavior. Is a check for $1000 written by me really worth less than a
check written by me for a billion dollars? They're both nominally
equivalent guarantees about an outcome, after all, though one has a
far greater monetary value. Which would you value more, subjectively?

Nothing is guaranteed -- even (and perhaps especially) strong guarantees.

> In fact, I would guess that in fact it will very rarely solve the
> problem. Normally, the XID age of a table never reaches
> autovacuum_freeze_max_age in the first place. If it does, there's some
> reason.

Probably, but none of this matters at all if the table age never
reaches autovacuum_freeze_max_age in the first place. We're only
talking about tables where that isn't the case, by definition.
Everything else is out of scope here.

> Maybe there's a really old open transaction or an abandon
> replication slot or an unresolved 2PC transaction. Maybe the
> autovacuum system is overloaded and no table is getting visited
> regularly because the system just can't keep up. Or maybe there are
> regular AELs being taken on the table at issue.

Maybe an asteroid hits the datacenter, making all of these
considerations irrelevant. But perhaps it won't!

> If there's only an AEL
> taken against a table once in blue moon, some autovacuum attempt ought
> to succeed before we reach autovacuum_freeze_max_age. Flipping that
> around, if we reach autovacuum_freeze_max_age without advancing
> relfrozenxid, and an AEL shows up behind us in the lock queue, it's
> really likely that the reason *why* we've reached
> autovacuum_freeze_max_age is that this same thing has happened to
> every previous autovacuum attempt and they all cancelled themselves.

Why do you assume that a previous autovacuum ever got launched in the
first place? There is always going to be a certain kind of table that
can only get an autovacuum when its table age crosses
autovacuum_freeze_max_age. And it's not just static tables -- there is
very good reason to have doubts about the statistics that drive
autovacuum. Plus vacuum_freeze_table_age works very unreliably (which
is why my big VACUUM patch more or less relegates it to a
compatibility option, while retaining a more sophisticated notion of
table age creating pressure to advance relfrozenxid).

Under the scheme from this autovacuum patch, it really does become
reasonable to make a working assumption that there was a previous
autovacuum, that failed (likely due to the autocancellation behavior,
as you said). We must have tried and failed in an earlier autovacuum,
once we reach the point of needing an antiwraparound autovacuum
(meaning a table age autovacuum which cannot be autocancelled) --
which is not the case today at all. If nothing else, table age
autovacuums will have been scheduled much earlier on -- they will have
at least started up, barring pathological cases.

That's a huge difference in the strength of the signal, compared to today.

The super aggressive autocancellation behavior is actually
proportionate to the problem at hand. Kind of like how if you go to
the doctor and tell them you have a headache, they don't schedule you
for emergency brain surgery. What they do is tell you to take an
aspirin, and make sure that you stay well hydrated -- if the problem
doesn't go away after a few days, then call back, reassess. Perhaps it
really will be a brain tumor, but there is nothing to gain and
everything to lose by taking such drastic action at the first sign of
trouble.

> If we cancel ourselves too, we're just postponing resolution of the
> problem to some future point when we decide to stop cancelling
> ourselves. That's not a win.

It's also only a very minor loss, relative to what would have happened
without any of this. This is something that we can be relatively sure
of (unlike anything about final outcomes). It's clear that we have a
lot to gain. What do we have to lose, really?

> > I think that users will really appreciate having only one kind of
> > VACUUM/autovacuum (since the other patch gets rid of discrete
> > aggressive mode VACUUMs). I want "table age autovacuuming" (as I
> > propose to call it) come to be seen as not any different to any other
> > autovacuum, such as an "insert tuples" autovacuum or a "dead tuples"
> > autovacuum. The difference is only in how autovacuum.c triggers the
> > VACUUM, not in any runtime behavior. That's an important goal here.
>
> I don't agree with that goal. I think that having different kinds of
> autovacuums with different, identifiable names and corresponding,
> easily-identifiable behaviors is really important for troubleshooting.

You need to distinguish between different types of autovacuums and
different types of VACUUMs here. Sure, it's valuable to have
information about why autovacuum launched a VACUUM, and the patch
greatly improves that. But runtime behavior is another story.

It's not really generic behavior -- more like generic policies that
produce different behavior under different runtime conditions. VACUUM
has always had generic policies about how to do things, at least up
until the introduction of the visibility map, which added
scan_all/aggressive VACUUMs, and the vacuum_freeze_table_age GUC. The
policy should be the same in every VACUUM, which the behavior itself
emerges from.

> Trying to remove those distinctions and make everything look the same
> will not keep autovacuum from getting itself into trouble. It will
> just make it harder to understand what's happening when it does.

The point isn't to have every VACUUM behave in the same way. The point
is to make decisions dynamically, based on the observed conditions in
the table. And to delay committing to things until there really is no
alternative, to maximize our opportunities to avoid disaster. In
short: loose, springy behavior.

Imposing absolute obligations on VACUUM has the potential to create
lots of problems. It is sometimes necessary, but can easily be
overused, making a bad situation much worse.

--
Peter Geoghegan

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Cary Huang 2023-01-12 22:37:40 Re: Patch: Global Unique Index
Previous Message Thomas Munro 2023-01-12 22:07:55 Re: Using WaitEventSet in the postmaster