Re: making relfilenodes 56 bits

From: Dilip Kumar <dilipbalaut(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: making relfilenodes 56 bits
Date: 2022-08-11 08:15:19
Message-ID: CAFiTN-v7Jb_v+ACbN41HfYGxZeLihV7=4mcvwHgFysg86VqVhQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Aug 11, 2022 at 10:58 AM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
>
> On Tue, Aug 9, 2022 at 8:51 PM Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> >
> > On Fri, Aug 5, 2022 at 3:25 AM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> > > I think even if we start the range from the 4 billion we can not avoid
> > > keeping two separate ranges for system and user tables otherwise the
> > > next upgrade where old and new clusters both have 56 bits
> > > relfilenumber will get conflicting files. And, for the same reason we
> > > still have to call SetNextRelFileNumber() during upgrade.
> >
> > Well, my proposal to move everything from the new cluster up to higher
> > numbers would address this without requiring two ranges.
> >
> > > So the idea is, we will be having 2 ranges for relfilenumbers, system
> > > range will start from 4 billion and user range maybe something around
> > > 4.1 (I think we can keep it very small though, just reserve 50k
> > > relfilenumber for system for future expansion and start user range
> > > from there).
> >
> > A disadvantage of this is that it basically means all the file names
> > in new clusters are going to be 10 characters long. That's not a big
> > disadvantage, but it's not wonderful. File names that are only 5-7
> > characters long are common today, and easier to remember.
>
> That's correct.
>
> > > So now system tables have no issues and also the user tables from the
> > > old cluster have no issues. But pg_largeobject might get conflict
> > > when both old and new cluster are using 56 bits relfilenumber, because
> > > it is possible that in the new cluster some other system table gets
> > > that relfilenumber which is used by pg_largeobject in the old cluster.
> > >
> > > This could be resolved if we allocate pg_largeobject's relfilenumber
> > > from the user range, that means this relfilenumber will always be the
> > > first value from the user range. So now if the old and new cluster
> > > both are using 56bits relfilenumber then pg_largeobject in both
> > > cluster would have got the same relfilenumber and if the old cluster
> > > is using the current 32 bits relfilenode system then the whole range
> > > of the new cluster is completely different than that of the old
> > > cluster.
> >
> > I think this can work, but it does rely to some extent on the fact
> > that there are no other tables which need to be treated like
> > pg_largeobject. If there were others, they'd need fixed starting
> > RelFileNumber assignments, or some other trick, like renumbering them
> > twice in the cluster, first two a known-unused value and then back to
> > the proper value. You'd have trouble if in the other cluster
> > pg_largeobject was 4bn+1 and pg_largeobject2 was 4bn+2 and in the new
> > cluster the reverse, without some hackery.
>
> Agree, if it has more catalog like pg_largeobject then it would
> require some hacking.
>
> > I do feel like your idea here has some advantages - my proposal
> > requires rewriting all the catalogs in the new cluster before we do
> > anything else, and that's going to take some time even though they
> > should be small. But I also feel like it has some disadvantages: it
> > seems to rely on complicated reasoning and special cases more than I'd
> > like.
>
> One other advantage with your approach is that since we are starting
> the "nextrelfilenumber" after the old cluster's relfilenumber range.
> So only at the beginning we need to set the "nextrelfilenumber" but
> after that while upgrading each object we don't need to set the
> nextrelfilenumber every time because that is already higher than the
> complete old cluster range. In other 2 approaches we will have to try
> to set the nextrelfilenumber everytime we preserve the relfilenumber
> during upgrade.

I was also thinking that whether we will get the max "relfilenumber"
from the old cluster at the cluster level or per database level? I
mean if we want to get database level we can run simple query on
pg_class and get it but there also we will need to see how to handle
the mapped relation if they are rewritten? I don't think we can get
the max relfilenumber from the old cluster at the cluster level.
Maybe in the newer version we can expose a function from the server to
just return the NextRelFileNumber and that would be the max
relfilenumber but I'm not sure how to do that in the old version.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Matthias van de Meent 2022-08-11 08:47:54 Re: Remaining case where reltuples can become distorted across multiple VACUUM operations
Previous Message houzj.fnst@fujitsu.com 2022-08-11 08:06:07 RE: Perform streaming logical transactions by background workers and parallel apply