Re: making relfilenodes 56 bits

From: Dilip Kumar <dilipbalaut(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: making relfilenodes 56 bits
Date: 2022-08-11 05:28:42
Message-ID: CAFiTN-vTw=XSU629euHirRezj_TtqDLgasL9Eak65RKgoStxVg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Aug 9, 2022 at 8:51 PM Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>
> On Fri, Aug 5, 2022 at 3:25 AM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> > I think even if we start the range from the 4 billion we can not avoid
> > keeping two separate ranges for system and user tables otherwise the
> > next upgrade where old and new clusters both have 56 bits
> > relfilenumber will get conflicting files. And, for the same reason we
> > still have to call SetNextRelFileNumber() during upgrade.
>
> Well, my proposal to move everything from the new cluster up to higher
> numbers would address this without requiring two ranges.
>
> > So the idea is, we will be having 2 ranges for relfilenumbers, system
> > range will start from 4 billion and user range maybe something around
> > 4.1 (I think we can keep it very small though, just reserve 50k
> > relfilenumber for system for future expansion and start user range
> > from there).
>
> A disadvantage of this is that it basically means all the file names
> in new clusters are going to be 10 characters long. That's not a big
> disadvantage, but it's not wonderful. File names that are only 5-7
> characters long are common today, and easier to remember.

That's correct.

> > So now system tables have no issues and also the user tables from the
> > old cluster have no issues. But pg_largeobject might get conflict
> > when both old and new cluster are using 56 bits relfilenumber, because
> > it is possible that in the new cluster some other system table gets
> > that relfilenumber which is used by pg_largeobject in the old cluster.
> >
> > This could be resolved if we allocate pg_largeobject's relfilenumber
> > from the user range, that means this relfilenumber will always be the
> > first value from the user range. So now if the old and new cluster
> > both are using 56bits relfilenumber then pg_largeobject in both
> > cluster would have got the same relfilenumber and if the old cluster
> > is using the current 32 bits relfilenode system then the whole range
> > of the new cluster is completely different than that of the old
> > cluster.
>
> I think this can work, but it does rely to some extent on the fact
> that there are no other tables which need to be treated like
> pg_largeobject. If there were others, they'd need fixed starting
> RelFileNumber assignments, or some other trick, like renumbering them
> twice in the cluster, first two a known-unused value and then back to
> the proper value. You'd have trouble if in the other cluster
> pg_largeobject was 4bn+1 and pg_largeobject2 was 4bn+2 and in the new
> cluster the reverse, without some hackery.

Agree, if it has more catalog like pg_largeobject then it would
require some hacking.

> I do feel like your idea here has some advantages - my proposal
> requires rewriting all the catalogs in the new cluster before we do
> anything else, and that's going to take some time even though they
> should be small. But I also feel like it has some disadvantages: it
> seems to rely on complicated reasoning and special cases more than I'd
> like.

One other advantage with your approach is that since we are starting
the "nextrelfilenumber" after the old cluster's relfilenumber range.
So only at the beginning we need to set the "nextrelfilenumber" but
after that while upgrading each object we don't need to set the
nextrelfilenumber every time because that is already higher than the
complete old cluster range. In other 2 approaches we will have to try
to set the nextrelfilenumber everytime we preserve the relfilenumber
during upgrade.

Other than these two approaches we have another approach (what the
patch set is already doing) where we keep the system relfilenumber
range same as Oid. I know you have already pointed out that this
might have some hidden bug but one advantage of this approach is it is
simple compared two above two approaches in the sense that it doesn't
need to maintain two ranges and it also doesn't need to rewrite all
system tables in the new cluster. So I think it would be good if we
can get others' opinions on all these 3 approaches.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Dilip Kumar 2022-08-11 05:31:51 Re: SUBTRANS: Minimizing calls to SubTransSetParent()
Previous Message Nathan Bossart 2022-08-11 05:18:50 Re: optimize lookups in snapshot [sub]xip arrays