Re: making relfilenodes 56 bits

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: making relfilenodes 56 bits
Date: 2022-07-06 21:24:29
Message-ID: CA+TgmoZTJsbJXoXWceQkaHdxjFz23BL=+xZiD64FgqVAQ77jPQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Jul 6, 2022 at 11:57 AM Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> I think 0002 and 0003 need more work yet; I'll try to write a review
> of those soon.

Regarding 0002:

I don't particularly like the names BufTagCopyRelFileLocator and
BufTagRelFileLocatorEquals. My suggestion is to rename
BufTagRelFileLocatorEquals to BufTagMatchesRelFileLocator, because it
doesn't really make sense to me to talk about equality between values
of different data types. Instead of BufTagCopyRelFileLocator I would
prefer BufTagGetRelFileLocator. That would make it more similar to
BufTagGetFileNumber and BufTagSetFileNumber, which I think would be a
good thing.

Other than that I think 0002 seems fine.

Regarding 0003:

/*
* Don't try to prefetch
anything in this database until
- * it has been created, or we
might confuse the blocks of
- * different generations, if a
database OID or
- * relfilenumber is reused.
It's also more efficient than
+ * it has been created,
because it's more efficient than
* discovering that relations
don't exist on disk yet with
* ENOENT errors.
*/

I'm worried that this might not be correct. The comment changes here
(and I think also in some other plces) imply that we've eliminated
relfilenode ruse, but I think that's not true. createdb() and movedb()
don't seem to be modified, so I think it's possible to just copy a
template database over without change, which means that relfilenumbers
and even relfilelocators could be reused. So I feel like maybe this
and similar places shouldn't be modified in this way. Am I
misunderstanding?

/*
- * Relfilenumbers are not unique in databases across
tablespaces, so we need
- * to allocate a new one in the new tablespace.
+ * Generate a new relfilenumber. Although relfilenumber are
unique within a
+ * cluster, we are unable to use the old relfilenumber since unused
+ * relfilenumber are not unlinked until commit. So if within a
+ * transaction, if we set the old tablespace again, we will
get conflicting
+ * relfilenumber file.
*/
- newrelfilenumber = GetNewRelFileNumber(newTableSpace, NULL,
-
rel->rd_rel->relpersistence);
+ newrelfilenumber = GetNewRelFileNumber();

I can't clearly understand this comment. Is it saying that the code
which follows is broken and needs to be fixed by a future patch before
things are OK again? If so, that's not good.

- * callers should be GetNewOidWithIndex() and GetNewRelFileNumber() in
- * catalog/catalog.c.
+ * callers should be GetNewOidWithIndex() in catalog/catalog.c.

If there is only one, it should say "caller", not "callers".

Orphan files are harmless --- at worst they waste a bit of disk space ---
-because we check for on-disk collisions when allocating new relfilenumber
-OIDs. So cleaning up isn't really necessary.
+because relfilenumber is 56 bit wide so logically there should not be any
+collisions. So cleaning up isn't really necessary.

I don't agree that orphaned files are harmless, but changing that is
beyond the scope of this patch. I think that the way you've ended the
sentence isn't sufficiently clear and correct even if we accept the
principle that orphaned files are harmless. What I think we should
stay instead is "because the relfilenode counter is monotonically
increasing. The maximum value is 2^56-1, and there is no provision for
wraparound."

+ /*
+ * Check if we set the new relfilenumber then do we run out of
the logged
+ * relnumber, if so then we need to WAL log again. Otherwise,
just adjust
+ * the relnumbercount.
+ */
+ relnumbercount = relnumber - ShmemVariableCache->nextRelFileNumber;
+ if (ShmemVariableCache->relnumbercount <= relnumbercount)
+ {
+ LogNextRelFileNumber(relnumber + VAR_RELNUMBER_PREFETCH);
+ ShmemVariableCache->relnumbercount = VAR_RELNUMBER_PREFETCH;
+ }
+ else
+ ShmemVariableCache->relnumbercount -= relnumbercount;

Would it be clearer, here and elsewhere, if VariableCacheData tracked
nextRelFileNumber and nextUnloggedRelFileNumber instead of
nextRelFileNumber and relnumbercount? I'm not 100% sure, but the idea
seems worth considering.

+ * Flush xlog record to disk before returning. To protect against file
+ * system changes reaching the disk before the
XLOG_NEXT_RELFILENUMBER log.

The way this is worded, you would need it to be just one sentence,
like "Flush xlog record to disk before returning to protect
against...". Or else add "this is," like "This is to protect
against..."

But I'm thinking maybe we could reword it a little more, perhaps
something like this: "Flush xlog record to disk before returning. We
want to be sure that the in-memory nextRelFileNumber value is always
larger than any relfilenumber that is already in use on disk. To
maintain that invariant, we must make sure that the record we just
logged reaches the disk before any new files are created."

This isn't a full review, I think, but I'm kind of out of time and
energy for today.

--
Robert Haas
EDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2022-07-06 21:46:27 Re: automatically generating node support functions
Previous Message Andrew Dunstan 2022-07-06 21:13:27 Re: Bump MIN_WINNT to 0x0600 (Vista) as minimal runtime in 16~