Re: pg_upgrade can result in early wraparound on databases with high transaction load

From: Noah Misch <noah(at)leadboat(dot)com>
To: Jason Harvey <jason(at)reddit(dot)com>, Peter Geoghegan <pg(at)bowt(dot)ie>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: pg_upgrade can result in early wraparound on databases with high transaction load
Date: 2019-06-15 18:37:59
Message-ID: 20190615183759.GB239428@rfd.leadboat.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general pgsql-hackers

On Tue, May 21, 2019 at 03:23:00PM -0700, Peter Geoghegan wrote:
> On Mon, May 20, 2019 at 3:10 AM Jason Harvey <jason(at)reddit(dot)com> wrote:
> > This week I upgraded one of my large(2.8TB), high-volume databases from 9 to 11. The upgrade itself went fine. About two days later, we unexpectedly hit transaction ID wraparound. What was perplexing about this was that the age of our oldest `datfrozenxid` was only 1.2 billion - far away from where I'd expect a wraparound. Curiously, the wraparound error referred to a mysterious database of `OID 0`:
> >
> > UPDATE ERROR: database is not accepting commands to avoid wraparound data loss in database with OID 0

That's bad.

> > We were able to recover after a few hours by greatly speeding up our vacuum on our largest table.

For what it's worth, a quicker workaround is to VACUUM FREEZE any database,
however small. That forces a vac_truncate_clog(), which recomputes the wrap
point from pg_database.datfrozenxid values. This demonstrates the workaround:

--- a/src/bin/pg_upgrade/test.sh
+++ b/src/bin/pg_upgrade/test.sh
@@ -248,7 +248,10 @@ case $testhost in
esac

pg_dumpall --no-sync -f "$temp_root"/dump2.sql || pg_dumpall2_status=$?
+pg_controldata "${PGDATA}"
+vacuumdb -F template1
pg_ctl -m fast stop
+pg_controldata "${PGDATA}"

if [ -n "$pg_dumpall2_status" ]; then
echo "pg_dumpall of post-upgrade database cluster failed"

> > In a followup investigation I uncovered the reason we hit the wraparound so early, and also the cause of the mysterious OID 0 message. When pg_upgrade executes, it calls pg_resetwal to set the next transaction ID. Within pg_resetwal is the following code: https://github.com/postgres/postgres/blob/6cd404b344f7e27f4d64555bb133f18a758fe851/src/bin/pg_resetwal/pg_resetwal.c#L440-L450

pg_upgrade should set oldestXID to the same value as the source cluster or set
it like vac_truncate_clog() would set it. Today's scheme is usually too
pessimistic, but it can be too optimistic if the source cluster was on the
bring of wrap. Thanks for the report.

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Peter Geoghegan 2019-06-16 00:35:52 Re: checkpoints taking much longer than expected
Previous Message Pavel Stehule 2019-06-15 14:37:19 Re: how to concat/concat_ws all fields without braces

Browse pgsql-hackers by date

  From Date Subject
Next Message Noah Misch 2019-06-15 18:47:39 Re: Do we expect tests to work with default_transaction_isolation=serializable
Previous Message Noah Misch 2019-06-15 18:07:32 Re: Dead encoding conversion functions