Re: pg_upgrade failing for 200+ million Large Objects

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Kumar, Sachin" <ssetiya(at)amazon(dot)com>
Cc: Nathan Bossart <nathandbossart(at)gmail(dot)com>, Jan Wieck <jan(at)wi3ck(dot)info>, Bruce Momjian <bruce(at)momjian(dot)us>, Zhihong Yu <zyu(at)yugabyte(dot)com>, "Andrew Dunstan" <andrew(at)dunslane(dot)net>, Magnus Hagander <magnus(at)hagander(dot)net>, Robins Tharakan <tharakan(at)gmail(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: pg_upgrade failing for 200+ million Large Objects
Date: 2023-12-20 23:47:44
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I have spent some more effort in this area and developed a patch
series that I think addresses all of the performance issues that
we've discussed in this thread, both for pg_upgrade and more
general use of pg_dump/pg_restore. Concretely, it absorbs
the pg_restore --transaction-size switch that I proposed before
to cut the number of transactions needed during restore, and
rearranges the representation of BLOB-related TOC entries to
reduce the client-side memory requirements, and fixes some
ancient mistakes that prevent both selective restore of BLOBs
and parallel restore of BLOBs.

As a demonstration, I made a database containing 100K empty blobs,
and measured the time needed to dump/restore that using -Fd
and -j 10. HEAD doesn't get any useful parallelism on blobs,
but with this patch series we do:

dump restore
HEAD: 14sec 15sec
after 0002: 7sec 10sec
after 0003: 7sec 3sec

There are a few loose ends:

* I did not invent a switch to control the batching of blobs; it's
just hard-wired at 1000 blobs per group here. Probably we need some
user knob for that, but I'm unsure if we want to expose a count or
just a boolean for one vs more than one blob per batch. The point of
forcing one blob per batch would be to allow exact control during
selective restore, and I'm not sure if there's any value in random
other settings. On the other hand, selective restore of blobs has
been completely broken for the last dozen years and I can't recall any
user complaints about that; so maybe nobody cares and we could just
leave this as an internal choice.

* Likewise, there's no user-accessible knob to control what
transaction size pg_upgrade uses. Do we need one? In any case, it's
likely that the default needs a bit more thought than I've given it.
I used 1000, but if pg_upgrade is launching parallel restore jobs we
likely need to divide that by the number of restore jobs.

* As the patch stands, we still build a separate TOC entry for each
comment or seclabel or ACL attached to a blob. If you have a lot of
blobs with non-default properties then the TOC bloat problem comes
back again. We could do something about that, but it would take a bit
of tedious refactoring, and the most obvious way to handle it probably
re-introduces too-many-locks problems. Is this a scenario that's
worth spending a lot of time on?

More details appear in the commit messages below. Patch 0004
is nearly the same as the v8 patch I posted before, although
it adds some logic to ensure that a large blob metadata batch
doesn't create too many locks.


regards, tom lane

PS: I don't see any active CF entry for this thread, so
I'm going to go make one.

Attachment Content-Type Size
v9-0001-Some-small-preliminaries-for-pg_dump-changes.patch text/x-diff 5.9 KB
v9-0002-In-dumps-group-large-objects-into-matching-metada.patch text/x-diff 39.3 KB
v9-0003-Move-BLOBS-METADATA-TOC-entries-into-SECTION_DATA.patch text/x-diff 3.0 KB
v9-0004-Invent-transaction-size-option-for-pg_restore.patch text/x-diff 14.7 KB

In response to


Browse pgsql-hackers by date

  From Date Subject
Next Message Jeremy Schneider 2023-12-20 23:47:51 Re: Built-in CTYPE provider
Previous Message Michael Paquier 2023-12-20 23:44:33 Re: ci: Build standalone INSTALL file