| From: | Andres Freund <andres(at)anarazel(dot)de> |
|---|---|
| To: | Nathan Bossart <nathandbossart(at)gmail(dot)com> |
| Cc: | Michael Paquier <michael(at)paquier(dot)xyz>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Nitin Motiani <nitinmotiani(at)google(dot)com>, Hannu Krosing <hannuk(at)google(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
| Subject: | Re: pg_upgrade: transfer pg_largeobject_metadata's files when possible |
| Date: | 2026-02-03 23:46:25 |
| Message-ID: | sc2uulu7gaajlnis7vqx25hmn6yd74drnlbzvupskx7jwyzejb@3vwodvuurtqb |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
Hi,
On 2026-02-03 18:16:17 -0500, Andres Freund wrote:
> On 2025-09-08 14:20:33 -0500, Nathan Bossart wrote:
> > On Sat, Sep 06, 2025 at 10:12:11AM +0900, Michael Paquier wrote:
> > > Yep. This plan is safe to rely on.
> >
> > Committed, thanks for reviewing!
>
> I just had occasion to look at this code and I found the way this works quite
> confusing:
>
> When upgrading from a new enough server, we migrate
> pg_largeobject_metadata. So far so good. But we *also* do a COPY FROM
> COPY "pg_catalog"."pg_largeobject_metadata" ("oid", "lomowner", "lomacl") FROM stdin;
> ...
>
> for all the large objects that have a description or a security label.
>
>
> For a while I was somewhat baffled, because that sure looks like it ought to
> lead to uniqueness violations. But it doesn't.
>
> The reason, I think, is that the COPY is happening into a relfilenode that
> will be overwritten later, it doesn't yet contain the contents of the old
> cluster.
>
> Presumably we do this because we need the temporary pg_largeobject_metadata to
> make COMMENT ON and security label commands not fail.
>
>
> If this is the reasoning / how it works, shouldn't there be a comment in the
> code or the commit message explaining that? Because it sure seems non-obvious
> to me.
>
> It's also not entirely obvious to me that this is safe - after all
> (bbe08b8869bd, revised in 0e758ae89) appeared to have taken some pains to
> ensure that the file gets unlinked immediately during the "binary upgrade
> mode" TRUNCATE. But now we are actually filling that file again, after the
> relation had been truncated?
An example of what could go wrong:
Imagine that we add support for freezing pages on-access (as we are
discussing). If pg_largeobject_metadata was *not* frozen / vacuumed on the old
server, there might not be a _vm on the old server, but due to the new
on-access freezing, there might be one created for the data in the ephemeral
pg_largeobject_metadata during the accesses from COMMENT ON.
Because there's no VM on the old server, transfer_relfile() would just return
false:
/* Is it an extent, fsm, or vm file? */
if (type_suffix[0] != '\0' || segno != 0)
{
/* Did file open fail? */
if (stat(old_file, &statbuf) != 0)
{
/* File does not exist? That's OK, just return */
if (errno == ENOENT)
return;
therefore not reaching the:
unlink(new_file);
Leaving a completely wrong visibilitymap in place.
There are other scary possibilities. Imagine that the ephemeral
pg_largeobject_metadata ends up bigger than on the old server, e.g. because we
are a bit more aggressive about bulk relation extension in the new version
(which we should be!). If that size difference ends up with a different
segment count between the old server and the ephemeral relation in the new
server, we're in trouble: transfer_relfile() wouldn't unlink the additional
segments on the target system.
Greetings,
Andres Freund
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Michael Paquier | 2026-02-04 00:03:01 | Re: AIX support |
| Previous Message | Nikolay Samokhvalov | 2026-02-03 23:42:16 | Re: IO wait events for COPY FROM/TO PROGRAM or file |