Re: pg_upgrade: transfer pg_largeobject_metadata's files when possible

From: Andres Freund <andres(at)anarazel(dot)de>
To: Nathan Bossart <nathandbossart(at)gmail(dot)com>
Cc: Michael Paquier <michael(at)paquier(dot)xyz>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Nitin Motiani <nitinmotiani(at)google(dot)com>, Hannu Krosing <hannuk(at)google(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: pg_upgrade: transfer pg_largeobject_metadata's files when possible
Date: 2026-02-03 23:46:25
Message-ID: sc2uulu7gaajlnis7vqx25hmn6yd74drnlbzvupskx7jwyzejb@3vwodvuurtqb
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2026-02-03 18:16:17 -0500, Andres Freund wrote:
> On 2025-09-08 14:20:33 -0500, Nathan Bossart wrote:
> > On Sat, Sep 06, 2025 at 10:12:11AM +0900, Michael Paquier wrote:
> > > Yep. This plan is safe to rely on.
> >
> > Committed, thanks for reviewing!
>
> I just had occasion to look at this code and I found the way this works quite
> confusing:
>
> When upgrading from a new enough server, we migrate
> pg_largeobject_metadata. So far so good. But we *also* do a COPY FROM
> COPY "pg_catalog"."pg_largeobject_metadata" ("oid", "lomowner", "lomacl") FROM stdin;
> ...
>
> for all the large objects that have a description or a security label.
>
>
> For a while I was somewhat baffled, because that sure looks like it ought to
> lead to uniqueness violations. But it doesn't.
>
> The reason, I think, is that the COPY is happening into a relfilenode that
> will be overwritten later, it doesn't yet contain the contents of the old
> cluster.
>
> Presumably we do this because we need the temporary pg_largeobject_metadata to
> make COMMENT ON and security label commands not fail.
>
>
> If this is the reasoning / how it works, shouldn't there be a comment in the
> code or the commit message explaining that? Because it sure seems non-obvious
> to me.
>
> It's also not entirely obvious to me that this is safe - after all
> (bbe08b8869bd, revised in 0e758ae89) appeared to have taken some pains to
> ensure that the file gets unlinked immediately during the "binary upgrade
> mode" TRUNCATE. But now we are actually filling that file again, after the
> relation had been truncated?

An example of what could go wrong:

Imagine that we add support for freezing pages on-access (as we are
discussing). If pg_largeobject_metadata was *not* frozen / vacuumed on the old
server, there might not be a _vm on the old server, but due to the new
on-access freezing, there might be one created for the data in the ephemeral
pg_largeobject_metadata during the accesses from COMMENT ON.

Because there's no VM on the old server, transfer_relfile() would just return
false:

/* Is it an extent, fsm, or vm file? */
if (type_suffix[0] != '\0' || segno != 0)
{
/* Did file open fail? */
if (stat(old_file, &statbuf) != 0)
{
/* File does not exist? That's OK, just return */
if (errno == ENOENT)
return;

therefore not reaching the:
unlink(new_file);

Leaving a completely wrong visibilitymap in place.

There are other scary possibilities. Imagine that the ephemeral
pg_largeobject_metadata ends up bigger than on the old server, e.g. because we
are a bit more aggressive about bulk relation extension in the new version
(which we should be!). If that size difference ends up with a different
segment count between the old server and the ephemeral relation in the new
server, we're in trouble: transfer_relfile() wouldn't unlink the additional
segments on the target system.

Greetings,

Andres Freund

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2026-02-04 00:03:01 Re: AIX support
Previous Message Nikolay Samokhvalov 2026-02-03 23:42:16 Re: IO wait events for COPY FROM/TO PROGRAM or file