Re: pg_upgrade failing for 200+ million Large Objects

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Kumar, Sachin" <ssetiya(at)amazon(dot)com>
Cc: Nathan Bossart <nathandbossart(at)gmail(dot)com>, Jan Wieck <jan(at)wi3ck(dot)info>, Bruce Momjian <bruce(at)momjian(dot)us>, Zhihong Yu <zyu(at)yugabyte(dot)com>, "Andrew Dunstan" <andrew(at)dunslane(dot)net>, Magnus Hagander <magnus(at)hagander(dot)net>, Robins Tharakan <tharakan(at)gmail(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: pg_upgrade failing for 200+ million Large Objects
Date: 2023-12-11 01:42:42
Message-ID: 2055911.1702258962@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I spent some time looking at the v7 patch. I can't help feeling
that this is going off in the wrong direction, primarily for
these reasons:

* It focuses only on cutting the number of transactions needed
to restore a large number of blobs (large objects). Certainly
that's a pain point, but it's not the only one of this sort.
If you have a lot of tables, restore will consume just as many
transactions as it would for a similar number of blobs --- probably
more, in fact, since we usually need more commands per table than
per blob.

* I'm not too thrilled with the (undocumented) rearrangements in
pg_dump. I really don't like the idea of emitting a fundamentally
different TOC layout in binary-upgrade mode; that seems unmaintainably
bug-prone. Plus, the XID-consumption problem is not really confined
to pg_upgrade.

What I think we actually ought to do is one of the alternatives
discussed upthread: teach pg_restore to be able to commit
every so often, without trying to provide the all-or-nothing
guarantees of --single-transaction mode. This cuts its XID
consumption by whatever multiple "every so often" is, while
allowing us to limit the number of locks taken during any one
transaction. It also seems a great deal safer than the idea
I floated of not taking locks at all during a binary upgrade;
plus, it has some usefulness with regular pg_restore that's not
under control of pg_upgrade.

So I had a go at coding that, and attached is the result.
It invents a --transaction-size option, and when that's active
it will COMMIT after every N TOC items. (This seems simpler to
implement and less bug-prone than every-N-SQL-commands.)

I had initially supposed that in a parallel restore we could
have child workers also commit after every N TOC items, but was
soon disabused of that idea. After a worker processes a TOC
item, any dependent items (such as index builds) might get
dispatched to some other worker, which had better be able to
see the results of the first worker's step. So at least in
this implementation, we disable the multi-command-per-COMMIT
behavior during the parallel part of the restore. Maybe that
could be improved in future, but it seems like it'd add a
lot more complexity, and it wouldn't make life any better for
pg_upgrade (which doesn't use parallel pg_restore, and seems
unlikely to want to in future).

I've not spent a lot of effort on pg_upgrade changes here:
I just hard-wired it to select --transaction-size=1000.
Given the default lock table size of 64*100, that gives us
enough headroom for each TOC to take half a dozen locks.
We could go higher than that by making pg_upgrade force the
destination postmaster to create a larger-than-default lock
table, but I'm not sure if it's worth any trouble. We've
already bought three orders of magnitude improvement as it
stands, which seems like enough ambition for today. (Also,
having pg_upgrade override the user's settings in the
destination cluster might not be without downsides.)

Another thing I'm wondering about is why this is only a pg_restore
option not also a pg_dump/pg_dumpall option. I did it like that
because --single-transaction is pg_restore only, but that seems more
like an oversight or laziness than a well-considered decision.
Maybe we should back-fill that omission; but it could be done later.

Thoughts?

regards, tom lane

Attachment Content-Type Size
v8-0001-restore-transaction-size-option.patch text/x-diff 12.1 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Masahiko Sawada 2023-12-11 01:57:15 Re: Make COPY format extendable: Extract COPY TO format implementations
Previous Message Thomas Munro 2023-12-11 00:57:39 Some useless includes in llvmjit_inline.cpp