Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Jeff Davis <pgsql(at)j-davis(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation
Date: 2023-01-17 04:11:05
Message-ID: CAH2-Wznsg-fp1vJR9_qLe6sRWqVVBiQsqt20xKwNBFqdLif84g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Jan 16, 2023 at 8:25 AM Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> I really dislike formulas like Min(freeze_max_age * 2, 1 billion).
> That looks completely magical from a user perspective. Some users
> aren't going to understand autovacuum behavior at all. Some will, and
> will be able to compare age(relfrozenxid) against
> autovacuum_freeze_max_age. Very few people are going to think to
> compare age(relfrozenxid) against some formula based on
> autovacuum_freeze_max_age. I guess if we document it, maybe they will.

What do you think of Andres' autovacuum_no_auto_cancel_age proposal?

As I've said several times already, I am by no means attached to the
current formula.

> I do like the idea of driving the auto-cancel behavior off of the
> results of previous attempts to vacuum the table. That could be done
> independently of the XID age of the table.

Even when the XID age of the table has already significantly surpassed
autovacuum_freeze_max_age, say due to autovacuum worker starvation?

> If we've failed to vacuum
> the table, say, 10 times, because we kept auto-cancelling, it's
> probably appropriate to force the issue.

I suggested 1000 times upthread. 10 times seems very low, at least if
"number of times cancelled" is the sole criterion, without any
attention paid to relfrozenxid age or some other tiebreaker.

> It doesn't really matter
> whether the autovacuum triggered because of bloat or because of XID
> age. Letting either of those things get out of control is bad.

While inventing a new no-auto-cancel behavior that prevents bloat from
getting completely out of control may well have merit, I don't see why
it needs to be attached to this other effort.

I think that the vast majority of individual tables have autovacuums
cancelled approximately never, and so my immediate concern is
ameliorating cases where not being able to auto-cancel once in a blue
moon causes an outage. Sure, the opposite problem also exists, and I
think that it would be really bad if it was made significantly worse
as an unintended consequence of a patch that addressed just the first
problem. But that doesn't mean we have to solve both problems together
at the same time.

> But at that point a lot of harm has already
> been done. In a frequently updated table, waiting 300 million XIDs to
> stop cancelling the vacuum is basically condemning the user to have to
> run VACUUM FULL. The table can easily be ten or a hundred times bigger
> than it should be by that point.

The rate at which relfrozenxid ages is just about useless as a proxy
for how much wall clock time has passed with a given workload --
workloads are usually very bursty. It's much worse still as a proxy
for what has changed in the table; completely static tables have their
relfrozenxid age at exactly the same rate as the most frequently
updated table in the same database (the table that "consumes the most
XIDs"). So while antiwraparound autovacuum no-auto-cancel behavior may
indeed save the user from problems with serious bloat, it will happen
pretty much by mistake. Not that it doesn't happen all the same -- of
course it does.

That factor (the mistake factor) doesn't mean I take the point any
less seriously. What I don't take seriously is the idea that the
precise XID age was ever crucially important.

More generally, I just don't accept that this leaves with no room for
something along the lines of my proposed, such as Andres'
autovacuum_freeze_max_age concept. As I've said already, there will
usually be a very asymmetric quality to the problem in cases like the
Joyent outage. Even a modest amount of additional XID-space-headroom
will very likely be all that will be needed at the critical juncture.
It may not be perfect, but it still has every potential to make things
safer for some users, without making things any less safe for other
users.

--
Peter Geoghegan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Dilip Kumar 2023-01-17 04:13:12 Re: New strategies for freezing, advancing relfrozenxid early
Previous Message Peter Smith 2023-01-17 03:48:34 Re: Perform streaming logical transactions by background workers and parallel apply