Re: What is "wraparound failure", really?

From: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To: Peter Geoghegan <pg(at)bowt(dot)ie>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: What is "wraparound failure", really?
Date: 2021-06-28 08:44:44
Message-ID: CAD21AoB_3+fFd3FA6svQ1Q-a0qCHeGd1yhJOp5odom3Vp9Jyuw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Jun 28, 2021 at 5:36 AM Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
>
> The wraparound failsafe mechanism added by commit 1e55e7d1 had minimal
> documentation -- just a basic description of how the GUCs work. I
> think that it certainly merits some discussion under "25.1. Routine
> Vacuuming" -- more specifically under "25.1.5. Preventing Transaction
> ID Wraparound Failures". One reason why this didn't happen in the
> original commit was that I just didn't know where to start with it.
> The docs in question have said this since 2006's commit 48188e16 first
> added autovacuum_freeze_max_age:
>
> "The sole disadvantage of increasing autovacuum_freeze_max_age (and
> vacuum_freeze_table_age along with it) is that the pg_xact and
> pg_commit_ts subdirectories of the database cluster will take more
> space..."
>
> This sentence seems completely unreasonable to me. It seems to just
> ignore the huge disadvantage of increasing autovacuum_freeze_max_age:
> the *risk* that the system will stop being able to allocate new XIDs
> because GetNewTransactionId() errors out with "database is not
> accepting commands to avoid wraparound data loss...". Sure, it's
> possible to take a lot of risk here without it ever blowing up in your
> face. And if it doesn't blow up then the downside really is zero. This
> is hardly a sensible way to talk about this important risk. Or any
> risk at all.
>
> At first I thought that the sentence was not just misguided -- it
> seemed downright bizarre. I thought that it was directly at odds with
> the title "Preventing Transaction ID Wraparound Failures". I thought
> that the whole point of this section was how not to have a wraparound
> failure (as I understand the term), and yet we seem to deliberately
> ignore the single most important practical aspect of making sure that
> that doesn't happen. But I now suspect that the basic definitions have
> been mixed up in a subtle but important way.
>
> What the documentation calls a "wraparound failure" seems to be rather
> different to what I thought that that meant. As I said, I thought that
> that meant the condition of being unable to get new transaction IDs
> (at least until the DBA runs VACUUM in single user mode). But the
> documentation in question seems to actually define it as "the
> condition of an old MVCC snapshot failing to see a version from the
> distant past, because somehow an XID wraparound suddenly makes it look
> as if it's in the distant future rather than in the past". It's
> actually talking about a subtly different thing, so the "sole
> disadvantage" sentence is not actually bizarre. It does still seem
> impractical and confusing, though.
>
> I strongly suspect that my interpretation of what "wraparound failure"
> means is actually the common one. Of course the system is never under
> any circumstances allowed to give totally wrong answers to queries, no
> matter what -- users should be able to take that much for granted.
> What users care about here is sensibly managing XIDs as a resource --
> preventing "XID exhaustion" while being conservative, but not
> ridiculously conservative. Could the documentation be completely
> misleading users here?
>
> I have two questions:
>
> 1. Do I have this right? Is there really confusion about what a
> "wraparound failure" means, or is the confusion mine alone?
>
> 2. How do I go about integrating discussion of the failsafe here?
> Anybody have thoughts on that?

Looking through the doc again, it seems to me that there is no
explicit explanation for the worst situation. It might be true in
principle that “XID wraparound failure” means catastrophic data loss
due to XID wraparound. But it doesn’t actually happen since we
disallow to allocate new XID three million XID before the wraparound.
In other words, entering the read-only mode is the worst situation in
PostgreSQL in terms of XID consumption. There is some description of
refusing to start any new transactions at the end of section 25.1.5
but it seems neither enough nor accurate. It describes the read-only
mode from only the aspect of a safeguard but not from the aspect of
the situation where we want to avoid. Explicitly describing also the
latter aspect could give weight to both the description of failsafe
mode, especially why we skip some operations to speed up increasing
relfrozenxid in that mode, and another disadvantage of increasing
autovacuum_freeze_max_age.
Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Dean Rasheed 2021-06-28 09:16:22 Numeric multiplication overflow errors
Previous Message Amit Kapila 2021-06-28 08:39:44 Re: Fix for segfault in logical replication on master