Re: pg15b2: large objects lost on upgrade

From: Justin Pryzby <pryzby(at)telsasoft(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Bruce Momjian <bruce(at)momjian(dot)us>, pgsql-hackers(at)postgresql(dot)org, Shruthi Gowda <gowdashru(at)gmail(dot)com>
Subject: Re: pg15b2: large objects lost on upgrade
Date: 2022-07-08 15:53:36
Message-ID: 20220708155336.GI13040@telsasoft.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Jul 07, 2022 at 03:11:38PM -0400, Robert Haas wrote:
> point: we assume that nothing significant has happened between when
> the cluster was created and when pg_upgrade is run, but we don't check
> it. Either we shouldn't assume it, or we should check it.
>
> So, is such activity ever legitimate? I think there are people doing
> it. The motivation is that maybe you have a dump from the old database
> that doesn't quite restore on the new version, but by doing something
> to the new cluster, you can make it restore. For instance, maybe there
> are some functions that used to be part of core and are now only
> available in an extension. That's going to make pg_upgrade's
> dump-and-restore workflow fail, but if you install that extension onto
> the new cluster, perhaps you can work around the problem. It doesn't
> have to be an extension, even. Maybe some function in core just got an
> extra argument, and you're using it, so the calls to that function
> cause dump-and-restore to fail. You might try overloading it in the
> new database with the old calling sequence to get things working.

I don't think that's even possible.

pg_upgrade drops template1 and postgres before upgrading:

* template1 database will already exist in the target installation,
* so tell pg_restore to drop and recreate it; otherwise we would fail
* to propagate its database-level properties.

* postgres database will already exist in the target installation, so
* tell pg_restore to drop and recreate it; otherwise we would fail to
* propagate its database-level properties.

For any other DBs, you'd hit an error if, after initdb'ing, you started the new
cluster, connected to it, created a DB (?!) and then tried to upgrade:

pg_restore: error: could not execute query: ERROR: database "pryzbyj" already exists

So if people start, connect, and then futz with a cluster before upgrading it,
it must be for global stuff (roles, tablespaces), and not per-DB stuff.
Also, pg_upgrade refuses to run if additional roles are defined...
So I'm not seeing what someone could do on the new cluster.

That supports the idea that it'd be okay to refuse to upgrade anything other
than a pristine cluster.

> Now, are these kinds of things considered to be supported? Well, I
> don't know that we've made any statement about that one way or the
> other. Perhaps they are not. But I can see why people want to use
> workarounds like this. The alternative is having to dump-and-restore
> instead of an in-place upgrade, and that's painfully slow by
> comparison.

The alternative in cases that I know about is to fix the old DB to allow it to
be upgraded. check.c has a list of the things that aren't upgradable, and The
fixes are some things like ALTER TABLE DROP OIDs. We just added another one to
handle v14 aggregates (09878cdd4).

> My view on this is that, while we probably don't want to make such
> things officially supported, I don't think we should ban it outright,
> either. We probably can't support an upgrade after the next cluster
> has been subjected to arbitrary amounts of tinkering, but we're making
> a mistake if we write code that has fragile assumptions for no really
> good reason. I think we can do better than this excerpt from your
> patch, for example:
>
> + /* Keep track of whether a filenode matches the OID */
> + if (maps[mapnum].relfilenumber == LargeObjectRelationId)
> + *has_lotable = true;
> + if (maps[mapnum].relfilenumber == LargeObjectLOidPNIndexId)
> + *has_loindex = true;
>
> I spent a while struggling to understand this because it seems to me
> that every database has an LO table and an LO index, so what's up with
> these names? I think what these names are really tracking is whether
> the relfilenumber of pg_largeobject and its index in the old database
> had their default values.

Yes, has_lotable means "has a LO table whose filenode matches the OID".
I will solicit suggestions for a better name.

> But this makes the assumption that the LO
> table and LO index in the new database have never been subjected to
> VACUUM FULL or CLUSTER and, while there's no real reason to do that, I
> can't quite see what the point of such an unnecessary and fragile
> assumption might be.

The idea of running initdb, starting the cluster, and connecting to it to run
VACUUM FULL scares me. Now that I think about it, it might be almost
inconsequential, since the initial DBs are dropped, and the upgrade will fail
if any non-template DB exists. But .. maybe something exciting happens if you
vacuum full a shared catalog... Yup.

With my patch:

./tmp_install/usr/local/pgsql/bin/initdb --no-sync -D pg15b1.dat -k
./tmp_install/usr/local/pgsql/bin/postgres -D pg15b1.dat -c logging_collector=no -p 5678 -k /tmp&
postgres=# \lo_import /etc/shells
postgres=# VACUUM FULL pg_largeobject;
./tmp_install/usr/local/pgsql/bin/initdb --no-sync -D pg15b2.dat -k
./tmp_install/usr/local/pgsql/bin/postgres -D pg15b2.dat -c logging_collector=no -p 5678 -k /tmp&
postgres=# VACUUM FULL pg_database;
tmp_install/usr/local/pgsql/bin/pg_upgrade -D pg15b2.dat -b tmp_install/usr/local/pgsql/bin -d pg15b1.dat

postgres=# SELECT COUNT(1), pg_relation_filenode(oid), array_agg(relname) FROM pg_class WHERE pg_relation_filenode(oid) IS NOT NULL GROUP BY 2 HAVING COUNT(1)>1 ORDER BY 1 DESC ;
count | pg_relation_filenode | array_agg
-------+----------------------+----------------------------------------------------
2 | 16388 | {pg_toast_1262_index,pg_largeobject_loid_pn_index}

I don't have a deep understanding why the DB hasn't imploded at this point,
maybe related to the filenode map file, but it seems very close to being
catastrophic.

It seems like pg_upgrade should at least check that the new cluster has no
objects with either OID or relfilenodes in the user range..
You could blame my patch, since I the issue is limited to pg_largeobject.

> Perhaps a better solution to this particular problem is to remove the
> backing files for the large object table and index *before* restoring
> the dump, deciding what files to remove by asking the running server
> for the file path. It might seem funny to allow for dangling pg_class
> entries, but we're going to create that situation for all other user
> rels anyway, and pg_upgrade treats pg_largeobject as a user rel.

I'll think about it more later.

--
Justin

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 2022-07-08 16:07:45 Re: System catalog documentation chapter
Previous Message Tom Lane 2022-07-08 15:49:47 Re: System catalog documentation chapter