Re: making relfilenodes 56 bits

From: Dilip Kumar <dilipbalaut(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: making relfilenodes 56 bits
Date: 2022-07-07 11:44:48
Message-ID: CAFiTN-s5ehHu2N9vuZcwg+NOENvf3v0FMXQQZq1faGjReP7mtw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Jul 7, 2022 at 2:54 AM Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

Thanks for committing the 0001.

> On Wed, Jul 6, 2022 at 11:57 AM Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> > I think 0002 and 0003 need more work yet; I'll try to write a review
> > of those soon.
>
> Regarding 0002:
>
> I don't particularly like the names BufTagCopyRelFileLocator and
> BufTagRelFileLocatorEquals. My suggestion is to rename
> BufTagRelFileLocatorEquals to BufTagMatchesRelFileLocator, because it
> doesn't really make sense to me to talk about equality between values
> of different data types. Instead of BufTagCopyRelFileLocator I would
> prefer BufTagGetRelFileLocator. That would make it more similar to
> BufTagGetFileNumber and BufTagSetFileNumber, which I think would be a
> good thing.
>
> Other than that I think 0002 seems fine.

Changed as suggested. Although I feel BufTagCopyRelFileLocator is
actually copying the relfilelocator from buffer tag to an input
variable, I am fine with BufTagGetRelFileLocator so that it is similar
to the other names.

Changed some other macro names as below because field name they are
getting/setting is relNumber
BufTagSetFileNumber -> BufTagSetRelNumber
BufTagGetFileNumber -> BufTagGetRelNumber

> Regarding 0003:

> I'm worried that this might not be correct. The comment changes here
> (and I think also in some other plces) imply that we've eliminated
> relfilenode ruse, but I think that's not true. createdb() and movedb()
> don't seem to be modified, so I think it's possible to just copy a
> template database over without change, which means that relfilenumbers
> and even relfilelocators could be reused. So I feel like maybe this
> and similar places shouldn't be modified in this way. Am I
> misunderstanding?

I think you are right, so I changed it.

> /*
> - * Relfilenumbers are not unique in databases across
> tablespaces, so we need
> - * to allocate a new one in the new tablespace.
> + * Generate a new relfilenumber. Although relfilenumber are
> unique within a
> + * cluster, we are unable to use the old relfilenumber since unused
> + * relfilenumber are not unlinked until commit. So if within a
> + * transaction, if we set the old tablespace again, we will
> get conflicting
> + * relfilenumber file.
> */
> - newrelfilenumber = GetNewRelFileNumber(newTableSpace, NULL,
> -
> rel->rd_rel->relpersistence);
> + newrelfilenumber = GetNewRelFileNumber();
>
> I can't clearly understand this comment. Is it saying that the code
> which follows is broken and needs to be fixed by a future patch before
> things are OK again? If so, that's not good.

No it is not broken in this patch. Basically, before our patch the
reason for allocating the new relfilenumber was that if we create the
file with oldrelfilenumber in new tablespace then it is possible that
in the new tablespace file with same name exist because relfilenumber
was unique in databases across tablespaces so there could be conflict.
But now that is not the case but still we can not reuse the old
relfilenumber because from the old tablespace the old relfilenumber
file is not removed until the next checkpoint so if we move the table
back to the old tablespace again then there could be conflict. And
even after we get the final patch of removing the tombstone file on
commit then also we can not reuse the old relfilenumber because within
a transaction we can switch between the tablespaces multiple times and
the relfilenumber file from the old tablespace will be removed only on
commit. This is what I am trying to explain in the comment.

Now I have modified the comment slightly, such that in 0002 I am
saying files are not removed until the next checkpoint and in 0004 I
am modifying that and saying not removed until commit.

> - * callers should be GetNewOidWithIndex() and GetNewRelFileNumber() in
> - * catalog/catalog.c.
> + * callers should be GetNewOidWithIndex() in catalog/catalog.c.
>
> If there is only one, it should say "caller", not "callers".
>
> Orphan files are harmless --- at worst they waste a bit of disk space ---
> -because we check for on-disk collisions when allocating new relfilenumber
> -OIDs. So cleaning up isn't really necessary.
> +because relfilenumber is 56 bit wide so logically there should not be any
> +collisions. So cleaning up isn't really necessary.
>
> I don't agree that orphaned files are harmless, but changing that is
> beyond the scope of this patch. I think that the way you've ended the
> sentence isn't sufficiently clear and correct even if we accept the
> principle that orphaned files are harmless. What I think we should
> stay instead is "because the relfilenode counter is monotonically
> increasing. The maximum value is 2^56-1, and there is no provision for
> wraparound."

Done

> + /*
> + * Check if we set the new relfilenumber then do we run out of
> the logged
> + * relnumber, if so then we need to WAL log again. Otherwise,
> just adjust
> + * the relnumbercount.
> + */
> + relnumbercount = relnumber - ShmemVariableCache->nextRelFileNumber;
> + if (ShmemVariableCache->relnumbercount <= relnumbercount)
> + {
> + LogNextRelFileNumber(relnumber + VAR_RELNUMBER_PREFETCH);
> + ShmemVariableCache->relnumbercount = VAR_RELNUMBER_PREFETCH;
> + }
> + else
> + ShmemVariableCache->relnumbercount -= relnumbercount;
>
> Would it be clearer, here and elsewhere, if VariableCacheData tracked
> nextRelFileNumber and nextUnloggedRelFileNumber instead of
> nextRelFileNumber and relnumbercount? I'm not 100% sure, but the idea
> seems worth considering.

I think it is in line with oidCount, what do you think?

>
> + * Flush xlog record to disk before returning. To protect against file
> + * system changes reaching the disk before the
> XLOG_NEXT_RELFILENUMBER log.
>
> The way this is worded, you would need it to be just one sentence,
> like "Flush xlog record to disk before returning to protect
> against...". Or else add "this is," like "This is to protect
> against..."
>
> But I'm thinking maybe we could reword it a little more, perhaps
> something like this: "Flush xlog record to disk before returning. We
> want to be sure that the in-memory nextRelFileNumber value is always
> larger than any relfilenumber that is already in use on disk. To
> maintain that invariant, we must make sure that the record we just
> logged reaches the disk before any new files are created."

Done

> This isn't a full review, I think, but I'm kind of out of time and
> energy for today.

I have updated some other comments as well.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment Content-Type Size
v7-0001-Preliminary-refactoring-for-supporting-larger.patch text/x-patch 20.0 KB
v7-0003-Assert-checking-to-be-merged-with-0002.patch text/x-patch 4.7 KB
v7-0002-Widen-relfilenumber-from-32-bits-to-56-bits.patch text/x-patch 78.1 KB
v7-0004-Don-t-delay-removing-Tombstone-file-until-next.patch text/x-patch 14.5 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Ranier Vilela 2022-07-07 11:53:29 Re: Avoid unecessary MemSet call (src/backend/utils/cache/relcache.c)
Previous Message Peter Eisentraut 2022-07-07 11:44:38 Re: Avoid unecessary MemSet call (src/backend/utils/cache/relcache.c)