Autovacuum vs vac_update_datfrozenxid() vs ?

From: Andres Freund <andres(at)anarazel(dot)de>
To: pgsql-hackers(at)postgresql(dot)org, michael(at)paquier(dot)xyz, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Autovacuum vs vac_update_datfrozenxid() vs ?
Date: 2020-03-23 23:50:36
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


While looking to understand what could be going on with [1], I think I
might have stumbled on a few issues that could at least explain parts of
the problem.

First, it seems to me that vac_update_datfrozenxid() can spuriously
(but temporarily) fail due to the 'bogus'
logic. vac_update_datfrozenxid() first determines the 'newest' xid it
* Identify the latest relfrozenxid and relminmxid values that we could
* validly see during the scan. These are conservative values, but it's
* not really worth trying to be more exact.
lastSaneFrozenXid = ReadNewTransactionId();
lastSaneMinMulti = ReadNextMultiXactId();

and then starts a full table scan of pg_class to determine the database
wide relfrozenxid:
scan = systable_beginscan(relation, InvalidOid, false,

Whenever vac_update_datfrozenxid() encounters a relfrozenxid that's "too
new", it'll bail out:
* If things are working properly, no relation should have a
* relfrozenxid or relminmxid that is "in the future". However, such
* cases have been known to arise due to bugs in pg_upgrade. If we
* see any entries that are "in the future", chicken out and don't do
* anything. This ensures we won't truncate clog before those
* relations have been scanned and cleaned up.
if (TransactionIdPrecedes(lastSaneFrozenXid, classForm->relfrozenxid) ||
MultiXactIdPrecedes(lastSaneMinMulti, classForm->relminmxid))
bogus = true;

Which all in theory makes sense - but there's two fairly sizable races
1) Because lastSaneFrozenXid is determined *before* the snapshot for
the pg_class, it's entirely possible that concurrently another vacuum on
another table in the same database starts & finishes. And that very well
can end up using a relfrozenxid that's newer than the current
2) Much worse, because vac_update_relstats() uses heap_inplace_update(),
there's actually no snapshot semantics here anyway! Which means that
the window for a race isn't just between the ReadNewTransactionId()
and the systable_beginscan(), but instead lasts for the duration of
the entire scan.

Either of these can then triggers vac_update_datfrozenxid() to
*silently* bail out without touching anything.

I think there's also another (even larger?) race in
vac_update_datfrozenxid(): Unless I miss something, two backends can
concurrently run through the scan in vac_update_datfrozenxid() for two
different tables in the same database, both can check that they need to
update the pg_database row, and then one of them can overwrite the
results of the other. And the one that updates last might actually be
the one with an older horizon. This is possible since there is no 'per
database' locking in heap_vacuum_rel().

The way I suspect that interacts with the issue in [1] ist that once
that has happened, do_start_worker() figures out it's doing a xid
wraparound vacuum based on the "wrong" datfrozenxid:
/* Check to see if this one is at risk of wraparound */
if (TransactionIdPrecedes(tmp->adw_frozenxid, xidForceLimit))
if (avdb == NULL ||
avdb = tmp;
for_xid_wrap = true;
else if (for_xid_wrap)
continue; /* ignore not-at-risk DBs */

which means autovac launcher will only start workers for that database -
but there might actually not be any work for it.

I have some theories as to why this is more common in 12, but I've not
fully nailed them down.


Andres Freund



Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 2020-03-24 00:50:55 Re: Define variables in the approprieate scope
Previous Message Alvaro Herrera 2020-03-23 23:32:27 Re: range_agg