Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation
Date: 2023-01-16 21:58:21
Message-ID: CAH2-Wzki1bPfm0yx8oXD1r407AA3dUsKsM6UEex5Cu=Ngb9Y2Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Jan 13, 2023 at 9:55 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
> I think part of our difference around a high autovacuum_freeze_max_age are due
> to things you're trying to address here - if no-auto-cancel is a separate
> threshold from autovacuum_freeze_max_age, it's less problematic to set
> autovacuum_freeze_max_age to something lower.

That seems like a much smaller difference of opinion than I'd imagined
it was before now.

> But yes, there's a remaining difference of opinion / experience.

Perhaps it's also a communication issue. I don't disagree with
pragmatic decisions that made sense given very particular limitations
in Postgres, which this is starting to sound like now.

When I express skepticism of very high autovacuum_freeze_max_age
settings, it's mostly just that I don't think that age(relfrozenxid)
is at all informative in the way that something that triggers
autovacuum ought to be. Worst of all, the older relfrozenxid gets, the
less informative it becomes. I'm sure that it can be safe to use very
high autovacuum_freeze_max_age values, but it seems like a case of
using a very complicated and unreliable thing to decide when to
VACUUM.

> > It might make sense to always give a small fixed amount of headroom
> > when autovacuum_freeze_max_age is set to very high values. Maybe just
> > 5 million XIDs/MXIDs. That would probably be just as effective as
> > (say) 500 million in almost all cases. But then you have to accept
> > another magic number.
>
> I suspect that most systems with a high autovacuum_freeze_max_age use it
> because 200m leads to too frequent autovacuums - a few million won't do much
> for those.

My point was mostly that it really doesn't cost us anything, and it
could easily help, so it might be worthwhile.

I wonder if these users specifically get too many aggressive
autovacuums with lower autovacuum_freeze_max_age settings? Maybe they
even fail to get enough non-aggressive autovacuums? If that's what it
is, then that makes perfect sense to me. However, I tend to think of
this as an argument against aggressive VACUUMs, not an argument in
favor of high autovacuum_freeze_max_age settings (as I said,
communication is hard).

> How about a float autovacuum_no_auto_cancel_age where positive values are
> treated as absolute values, and negative values are a multiple of
> autovacuum_freeze_max_age? And where the "computed" age is capped at
> vacuum_failsafe_age? A "failsafe" autovacuum clearly shouldn't be cancelled.
>
> And maybe a default setting of -1.8 or so?

I think that would work fine, as a GUC (with no reloption).

> Well, historically there wasn't all that much protection. And I suspect there
> still might be some dragons. We really need to get rid of the remaining places
> that cache 32bit xids across transactions.

Maybe, but our position is that it's supported, it's safe -- there
will be no data loss (or else it's a bug, and one that we'd take very
seriously at that). Obviously it's a bad idea to allow this to happen,
but surely having the system enter xidStopLimit is sufficient
disincentive for users.

> I've seen a ~2TB table grow to ~20TB due to dead tuples, at which point the
> server crash-restarted due to WAL ENOSPC. I think in that case there wasn't
> even DDL, they just needed to make the table readonly, for a brief moment, a
> few times a day. The problem started once the indexes grew to be large enough
> that the time for (re-)finding dead tuples + and an index scan phase got large
> enough that they were unlucky to be killed a few times in a row. After that
> autovac never got through the index scan phase. Not doing any "new" work, just
> collecting the same dead tids over and over, scanning the indexes for those
> tids, never quite finishing.

That makes sense to me, though I wonder if there would have been
another kind of outage (not the one you actually saw) had the
autocancellation behavior somehow been disabled after the first few
cancellations.

This sounds like a great argument in favor of suspend-and-resume as a
way of handling autocancellation -- no useful work needs to be thrown
away for AV to yield for a minute or two. One ambition of mine for the
visibility map snapshot infrastructure was to be able support
suspend-and-resume. It wouldn't be that hard for autovacuum to
serialize everything, and use that to pick up right where an
autovacuum worker left of at when it was autocancelled. Same
OldestXmin starting point as before, same dead_items array, same
number of scanned_pages (and pages left to scan).

> > What you call aspect 2 (the issue with disastrous HW lock traffic jams
> > involving TRUNCATE being run from a cron job, etc) is a big goal of
> > mine for this patch series. You seem unsure of how effective my
> > approach (or an equally simple approach based on table age heuristics)
> > will be. Is that the case?
>
> I am was primarily concerned about situations where an admin interactively was
> trying to get rid of the table holding back the xid horizon, after some
> problem caused a lot of work to pile up.

Fair enough, but the outages that I'm mostly thinking about here
weren't really like that. There wasn't anything holding back the
horizon, at any point. It was just that the autocancellation behavior
was disruptive in some critical way.

> If you know postgres internals, it's easy. You know that autovac stops
> auto-cancelling after freeze_max_age and you know that the lock queue is
> ordered. So you issue DROP TABLE / TRUNCATE / whatever and immediately
> afterwards cancel the autovac worker. But if you aren't, it'll take a while to
> figure out that the DROP TABLE isn't progressing due to a lock conflict, at
> which point you'll cancel the statement (possibly having wrecked havoc with
> all other accesses). Then you figure out that you need to cancel
> autovacuum. After you try the DROP TABLE again - but the next worker has
> gotten to work on the table.

Yeah, that's pretty bad. Maybe DROP TABLE and TRUNCATE should be
special cases? Maybe they should always be able to auto cancel an
autovacuum?

> > > I have two ideas around 3):
> > >
> > > First, we could introduce thresholds for the tuple thresholds, after which
> > > autovacuum isn't concealable anymore.
> >
> > Do you think that's a good idea? I just don't trust those statistics,
> > at all. As in, I think they're often complete garbage.
>
> I have seen them be reasonably accurate in plenty busy systems. The most
> common problem I've seen is that the amount of assumed dead / inserted tuples
> is way too high, because the server crash-restarted at some point. We now
> have most of the infrastructure for using a slightly older version of stats
> after a crash, which'd make them less inaccurate.

Did you mean way too low?

> I haven't recently seen crazy over-estimates of the number of dead tuples, at
> least if you count dead tids as tuples. It might be worth to split the dead
> tuple count into a dead tuples and dead items.
>
> Are you mostly seen over or under estimates?

It's not so much what I've seen. It's that the actual approach has
lots of problems.

Referring to my notes, here are what seem to me to be serious problems:

* We are very naive about what a dead tuple even means, and we totally
fail to account for the fact that only the subset of heap pages that
are PageIsAllVisible() are interesting to VACUUM -- focusing on the
whole table just seems wrong.

Per https://postgr.es/m/CAH2-Wz=MGFwJEpEjVzXwEjY5yx=UuNPzA6Bt4DSMasrGLUq9YA@mail.gmail.com

* Stub LP_DEAD items in heap pages are extremely different to dead
heap-only tuples in heap pages, which we ignore.

Per https://postgr.es/m/CAH2-WznrZC-oHkB+QZQS65o+8_Jtj6RXadjh+8EBqjrD1f8FQQ@mail.gmail.com

* Problem where insert-driven autovacuums
(autovacuum_vacuum_insert_threshold/autovacuum_vacuum_insert_scale_factor
triggers AVs) become further spaced apart as a consequence of one
VACUUM operation taking far longer than usual (generally because it's
an aggressive VACUUM that follows several non-aggressive VACUUMs).

Per https://postgr.es/m/CAH2-Wzn=bZ4wynYB0hBAeF4kGXGoqC=PZVKHeerBU-je9AQF=g@mail.gmail.com

It's quite possible to get approximately the desired outcome with an
algorithm that is completely wrong -- the way that we sometimes need
autovacuum_freeze_max_age to deal with bloat is a great example of
that. Even then, there may still be serious problems that are well
worth solving.

> > However, I also don't like several things about this approach. First
> > of all it relies on relatively complicated infrastructure, for
> > something that can be critical.
>
> I don't think it relies on much more machinery than we already rely on? The
> dead/inserted tuple estimates and come from pgstat already. The only really
> new piece would be that the worker would need to do a
> pgstat_report_autovac_failure() after getting cancelled, which doesn't seem
> too hard?

Yeah, but it is a new dependency on stored state. Certainly doable,
but hard enough that it might be better to add that part later on.

> Somehow this made me think of a somewhat crazy, and largely unrelated, idea:
> Why don't we use the the currently unused VM bit combination to indicate pages
> with dead tids? We could have an [auto]vacuum mode where it scans just pages
> with the dead tids bit set. Particularly when on-access heap pruning is doing
> most of the work, that could be quite useful to more cheaply get rid of the
> dead tids. Obviously we'd also set them when vacuum decides / was told not to
> do index cleanup. Yes, it'd obviously be less effective than lots of the
> things we discussed in this area (needing to re-collect the dead tids on the
> indicated), but it'd have the advantage of not needing a lot of new
> infrastructure.

I wonder if it would be possible to split up the work of VACUUM into
multiple phases that can be processed independently. The dead_items
array could be serialized and stored in a temp file. That's not the
same as some of the more complicated stuff we talked about in the last
couple of years, such as a dedicated fork for Dead TIDs. It's more
like an extremely flexible version of the same basic design for
VACUUM, with the ability to slow down and speed up based on a system
level view of things (e.g., checkpointing information). And with index
vacuuming happening on a highly deferred timeline in some cases.
Possibly we could make each slice of work processed by any available
autovacuum worker. Autovacuum workers could become "headless".

You would need some kind of state machine to make sure that critical
dependencies were respected (e.g., always do the heap vacuuming step
after all indexes are vacuumed), but that possibly isn't that hard,
and still gives you a lot.

As for this patch of mine: do you think that it would be acceptable to
pursue a version based on your autovacuum_no_auto_cancel_age design
for 16? Perhaps this can include something like
pgstat_report_autovac_failure(). It's not even the work of
implementing pgstat_report_autovac_failure() that creates risk that
it'll miss the 16 feature freeze deadline. I'm more concerned that
introducing a more complicated design will lead to the patch being
bikeshedded to death.

--
Peter Geoghegan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Thomas Munro 2023-01-16 22:24:23 Re: BF animal malleefowl reported an failure in 001_password.pl
Previous Message Tom Lane 2023-01-16 21:48:28 Re: Extracting cross-version-upgrade knowledge from buildfarm client