Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation

From: Andres Freund <andres(at)anarazel(dot)de>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Peter Geoghegan <pg(at)bowt(dot)ie>, Jeff Davis <pgsql(at)j-davis(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation
Date: 2023-01-13 21:59:56
Message-ID: 20230113215956.nx2orfnlcixjyqia@awork3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2023-01-12 16:08:06 -0500, Robert Haas wrote:
> Normally, the XID age of a table never reaches autovacuum_freeze_max_age in
> the first place.

That's not at all my experience. I often see it being the primary reason for
autovacuum vacuuming large tables on busy OLTP systems. Even without any
longrunning transactions or such, with available autovac workers and without
earlier autovacuums getting interrupted by locks. Once a table is large,
reasonable scale factors require a lot of changes to accumulate to trigger an
autovacuum, and during a vacuum a lot of transactions complete, leading to
large tables having a significant age by the time autovac finishes.

The most common "bad" reason for reaching autovacuum_freeze_max_age that I see
is cost limits not allowing vacuum to complete on time.

Perhaps we should track how often autovacuum was triggered by what reason in
a relation's pgstats? That'd make it a lot easier to collect data, both for
tuning the thresholds on a system, and for hacking on postgres.

Tracking the number of times autovacuum was interruped due to a lock request
might be a good idea as well?

I think it'd be a good idea to split off the part of the patch that introduces
AutoVacType / adds logging for what triggered. That's independently useful,
likely uncontroversial and makes the remaining patch smaller.

I'd also add the trigger to the pg_stat_activity entry for the autovac
worker. Additionally I think we should add information about using failsafe
mode to the p_s_a entry.

I've wished for a non-wraparound, xid age based, "autovacuum trigger" many
times, FWIW. And I've seen plenty of places write their own userspace version
of it, because without it they run into trouble. However, I don't like that
the patch infers the various thresholds using magic constants / multipliers.

autovacuum_freeze_max_age is really a fairly random collection of things:
1) triggers autovacuum on tables based on age, in addition to the dead tuple /
inserted tuples triggers
2) prevents auto-cancellation of autovacuum
3) starts autovacuum, even if autovacuum is disabled

IME hitting 1) isn't a reason for concern, it's perfectly normal. Needing 2)
to make progress is a bit more concerning. 3) should rarely be needed, but is
a good safety mechanism.

I doubt that controlling all of them via one GUC is sensible.

If I understand the patch correctly, we now have the following age based
thresholds for av:

- force-enable autovacuum:
oldest_datfrozenxid + autovacuum_freeze_max_age < nextXid
- autovacuum based on age:
freeze_max_age = Min(autovacuum_freeze_max_age, table_freeze_max_age)
tableagevac = relfrozenxid < recentXid - freeze_max_age
- prevent auto-cancellation:
freeze_max_age = Min(autovacuum_freeze_max_age, table_freeze_max_age)
prevent_auto_cancel_age = Min(freeze_max_age * 2, 1 billion)
prevent_auto_cancel = reflrozenxid < recentXid - prevent_auto_cancel_age

Is that right?

One thing I just noticed: Isn't it completely bonkers that we compute
recentXid/recentMulti once at the start of a worker in
relation_needs_vacanalyze()? That's fine for the calls in do_autovacuum()'s
initial loops over all tables. But seems completely wrong for the later calls
via table_recheck_autovac() -> recheck_relation_needs_vacanalyze() ->
relation_needs_vacanalyze()?

These variables really shouldn't be globals. It makes sense to cache them
locally in do_autovacuum(), but reusing them
recheck_relation_needs_vacanalyze() and sharing it between launcher and worker
is bad.

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrew Dunstan 2023-01-13 22:20:41 Fixes required for cross version update testing
Previous Message Jeff Davis 2023-01-13 21:30:28 Re: allow granting CLUSTER, REFRESH MATERIALIZED VIEW, and REINDEX