pg_upgrade can result in early wraparound on databases with high transaction load

From: Jason Harvey <jason(at)reddit(dot)com>
To: pgsql-general(at)postgresql(dot)org
Subject: pg_upgrade can result in early wraparound on databases with high transaction load
Date: 2019-05-20 10:10:17
Message-ID: CALSof1GM6i21BLr8PsFiRYazakojvesdc+_MiR-L_V5NNkRuWg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general pgsql-hackers

Hello,

This week I upgraded one of my large(2.8TB), high-volume databases from 9
to 11. The upgrade itself went fine. About two days later, we unexpectedly
hit transaction ID wraparound. What was perplexing about this was that the
age of our oldest `datfrozenxid` was only 1.2 billion - far away from where
I'd expect a wraparound. Curiously, the wraparound error referred to a
mysterious database of `OID 0`:

UPDATE ERROR: database is not accepting commands to avoid wraparound data
loss in database with OID 0

We were able to recover after a few hours by greatly speeding up our vacuum
on our largest table.

In a followup investigation I uncovered the reason we hit the wraparound so
early, and also the cause of the mysterious OID 0 message. When pg_upgrade
executes, it calls pg_resetwal to set the next transaction ID. Within
pg_resetwal is the following code:
https://github.com/postgres/postgres/blob/6cd404b344f7e27f4d64555bb133f18a758fe851/src/bin/pg_resetwal/pg_resetwal.c#L440-L450

This sets the controldata to have a fake database (OID 0) on the brink of
transaction wraparound. Specifically, after pg_upgrade is ran, wraparound
will occur within around 140 million transactions (provided the autovacuum
doesn't finish first). I confirmed by analyzing our controldata before and
after the upgrade that this was the cause of our early wraparound.

Given the size and heavy volume of our database, we tend to complete a
vacuum in the time it takes around 250 million transactions to execute.
With our tunings this tends to be rather safe and we stay well away from
the wraparound point under normal circumstances.

Unfortunately we had no obvious way of knowing that the upgrade would place
our database upon the brink of wraparound. In fact, since this info is only
persisted in the controldata, the only way to discover this state to my
knowledge would be to inspect the controldata itself. Other standard means
of monitoring for wraparound risk involve watching `pg_database` or
`pg_class`, which in this case tells us nothing helpful since the fake
database present in the controldata is not represented in those stats.

I'd like to suggest that either the pg_upgrade->pg_resetwal behaviour be
adjusted, or the pg_upgrade documentation highlight this potential
scenario. I'm happy to contribute code and/or documentation pull requests
to accomplish this.

Thank you,
Jason Harvey
reddit.com

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Francisco Olarte 2019-05-20 10:18:12 Re: Bug in documentation (trim(...))?
Previous Message Johann Spies 2019-05-20 10:09:32 Bug in documentation (trim(...))?

Browse pgsql-hackers by date

  From Date Subject
Next Message Matwey V. Kornilov 2019-05-20 11:32:39 [PATCH v2] Introduce spgist quadtree @<(point,circle) operator
Previous Message Kyotaro HORIGUCHI 2019-05-20 07:59:05 Re: Statistical aggregate functions are not working with PARTIAL aggregation