What is "wraparound failure", really?

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
Subject: What is "wraparound failure", really?
Date: 2021-06-27 20:36:19
Message-ID: CAH2-Wzk_FxfJvs4TnUtj=DCsokbiK0CxfjZ9jjrfSx8sTWkeUg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

The wraparound failsafe mechanism added by commit 1e55e7d1 had minimal
documentation -- just a basic description of how the GUCs work. I
think that it certainly merits some discussion under "25.1. Routine
Vacuuming" -- more specifically under "25.1.5. Preventing Transaction
ID Wraparound Failures". One reason why this didn't happen in the
original commit was that I just didn't know where to start with it.
The docs in question have said this since 2006's commit 48188e16 first
added autovacuum_freeze_max_age:

"The sole disadvantage of increasing autovacuum_freeze_max_age (and
vacuum_freeze_table_age along with it) is that the pg_xact and
pg_commit_ts subdirectories of the database cluster will take more
space..."

This sentence seems completely unreasonable to me. It seems to just
ignore the huge disadvantage of increasing autovacuum_freeze_max_age:
the *risk* that the system will stop being able to allocate new XIDs
because GetNewTransactionId() errors out with "database is not
accepting commands to avoid wraparound data loss...". Sure, it's
possible to take a lot of risk here without it ever blowing up in your
face. And if it doesn't blow up then the downside really is zero. This
is hardly a sensible way to talk about this important risk. Or any
risk at all.

At first I thought that the sentence was not just misguided -- it
seemed downright bizarre. I thought that it was directly at odds with
the title "Preventing Transaction ID Wraparound Failures". I thought
that the whole point of this section was how not to have a wraparound
failure (as I understand the term), and yet we seem to deliberately
ignore the single most important practical aspect of making sure that
that doesn't happen. But I now suspect that the basic definitions have
been mixed up in a subtle but important way.

What the documentation calls a "wraparound failure" seems to be rather
different to what I thought that that meant. As I said, I thought that
that meant the condition of being unable to get new transaction IDs
(at least until the DBA runs VACUUM in single user mode). But the
documentation in question seems to actually define it as "the
condition of an old MVCC snapshot failing to see a version from the
distant past, because somehow an XID wraparound suddenly makes it look
as if it's in the distant future rather than in the past". It's
actually talking about a subtly different thing, so the "sole
disadvantage" sentence is not actually bizarre. It does still seem
impractical and confusing, though.

I strongly suspect that my interpretation of what "wraparound failure"
means is actually the common one. Of course the system is never under
any circumstances allowed to give totally wrong answers to queries, no
matter what -- users should be able to take that much for granted.
What users care about here is sensibly managing XIDs as a resource --
preventing "XID exhaustion" while being conservative, but not
ridiculously conservative. Could the documentation be completely
misleading users here?

I have two questions:

1. Do I have this right? Is there really confusion about what a
"wraparound failure" means, or is the confusion mine alone?

2. How do I go about integrating discussion of the failsafe here?
Anybody have thoughts on that?

--
Peter Geoghegan

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Justin Pryzby 2021-06-27 21:57:03 pg14b2: FailedAssertion("_bt_posting_valid(nposting)", File: "nbtdedup.c", ...
Previous Message Tom Lane 2021-06-27 20:21:46 Re: Overflow hazard in pgbench