Re: making relfilenodes 56 bits

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: making relfilenodes 56 bits
Date: 2022-09-20 17:14:46
Message-ID: CA+TgmobFti5JuRZKq7hF8JJJgmyr1vpC-cOsu3dVc1dka9wApA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Sep 9, 2022 at 6:02 AM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> [ new patch ]

+typedef pg_int64 RelFileNumber;

This seems really random to me. First, why isn't this an unsigned
type? OID is unsigned and I don't see a reason to change to a signed
type. But even if we were going to change to a signed type, why
pg_int64? That is declared like this:

/* Define a signed 64-bit integer type for use in client API declarations. */
typedef PG_INT64_TYPE pg_int64;

Surely this is not a client API declaration....

Note that if we change this a lot of references to INT64_FORMAT will
need to become UINT64_FORMAT.

I think we should use int64 at the SQL level, because we don't have an
unsigned 64-bit SQL type, and a signed 64-bit type can hold 56 bits.
So it would still be Int64GetDatum((int64) rd_rel->relfilenode) or
similar. But internally I think using unsigned is cleaner.

+ * RelFileNumber is unique within a cluster.

Not really, because of CREATE DATABASE. Probably just drop this line.
Or else expand it: we never assign the same RelFileNumber twice within
the lifetime of the same cluster, but there can be multiple relations
with the same RelFileNumber e.g. because CREATE DATABASE duplicates
the RelFileNumber values from the template database. But maybe we
don't need this here, as it's already explained in relfilelocator.h.

+ ret = (int8) (tag->relForkDetails[0] >> BUFTAG_RELNUM_HIGH_BITS);

Why not declare ret as ForkNumber instead of casting twice?

+ uint64 relnum;
+
+ Assert(relnumber <= MAX_RELFILENUMBER);
+ Assert(forknum <= MAX_FORKNUM);
+
+ relnum = relnumber;

Perhaps it'd be better to write uint64 relnum = relnumber instead of
initializing on a separate line.

+#define RELNUMBERCHARS 20 /* max chars printed by %llu */

Maybe instead of %llu we should say UINT64_FORMAT (or INT64_FORMAT if
there's some reason to stick with a signed type).

+ elog(ERROR, "relfilenumber is out of bound");

It would have to be "out of bounds", with an "s". But maybe "is too
large" would be better.

+ nextRelFileNumber = ShmemVariableCache->nextRelFileNumber;
+ loggedRelFileNumber = ShmemVariableCache->loggedRelFileNumber;
+ flushedRelFileNumber = ShmemVariableCache->flushedRelFileNumber;

Maybe it would be a good idea to asset that next <= flushed and
flushed <= logged?

+#ifdef USE_ASSERT_CHECKING
+
+ {
+ RelFileLocatorBackend rlocator;
+ char *rpath;

Let's add a comment here, like "Because the RelFileNumber counter only
ever increases and never wraps around, it should be impossible for the
newly-allocated RelFileNumber to already be in use. But, if Asserts
are enabled, double check that there's no main-fork relation file with
the new RelFileNumber already on disk."

+ elog(ERROR, "cannot forward RelFileNumber during recovery");

forward -> set (or advance)

+ if (relnumber >= ShmemVariableCache->loggedRelFileNumber)

It probably doesn't make any difference, but to me it seems better to
test flushedRelFileNumber rather than logRelFileNumber here. What do
you think?

/*
* We set up the lockRelId in case anything tries to lock the dummy
- * relation. Note that this is fairly bogus since relNumber may be
- * different from the relation's OID. It shouldn't really matter though.
- * In recovery, we are running by ourselves and can't have any lock
- * conflicts. While syncing, we already hold AccessExclusiveLock.
+ * relation. Note we are setting relId to just FirstNormalObjectId which
+ * is completely bogus. It shouldn't really matter though. In recovery,
+ * we are running by ourselves and can't have any lock conflicts. While
+ * syncing, we already hold AccessExclusiveLock.
*/
rel->rd_lockInfo.lockRelId.dbId = rlocator.dbOid;
- rel->rd_lockInfo.lockRelId.relId = rlocator.relNumber;
+ rel->rd_lockInfo.lockRelId.relId = FirstNormalObjectId;

Boy, this makes me uncomfortable. The existing logic is pretty bogus,
and we're replacing it with some other bogus thing. Do we know whether
anything actually does try to use this for locking?

One notable difference between the existing logic and your change is
that, with the existing logic, we use a bogus value that will differ
from one relation to the next, whereas with this change, it will
always be the same value. Perhaps el->rd_lockInfo.lockRelId.relId =
(Oid) rlocator.relNumber would be a more natural adaptation?

+#define CHECK_RELFILENUMBER_RANGE(relfilenumber) \
+do { \
+ if ((relfilenumber) < 0 || (relfilenumber) > MAX_RELFILENUMBER) \
+ ereport(ERROR, \
+ errcode(ERRCODE_INVALID_PARAMETER_VALUE), \
+ errmsg("relfilenumber %lld is out of range", \
+ (long long) (relfilenumber))); \
+} while (0)

Here, you take the approach of casting the relfilenumber to long long
and then using %lld. But elsewhere, you use
INT64_FORMAT/UINT64_FORMAT. If we're going to use this technique, we
ought to use it everywhere.

typedef struct
{
- Oid reltablespace;
- RelFileNumber relfilenumber;
-} RelfilenumberMapKey;
-
-typedef struct
-{
- RelfilenumberMapKey key; /* lookup key - must be first */
+ RelFileNumber relfilenumber; /* lookup key - must be first */
Oid relid; /* pg_class.oid */
} RelfilenumberMapEntry;

This feels like a bold change. Are you sure it's safe? i.e. Are you
certain that there's no way that a relfilenumber could repeat within a
database? If we're going to bank on that, we could adapt this more
heavily, e.g. RelidByRelfilenumber() could lose the reltablespace
parameter. I think maybe we should push this change into an 0002 patch
(or later) and have 0001 just do a minimal adaptation for the changed
data type.

Datum
pg_control_checkpoint(PG_FUNCTION_ARGS)
{
- Datum values[18];
- bool nulls[18];
+ Datum values[19];
+ bool nulls[19];

Documentation updated is needed.

-Note that while a table's filenode often matches its OID, this is
-<emphasis>not</emphasis> necessarily the case; some operations, like
+Note that table's filenode are completely different than its OID. Although for
+system catalogs initial filenode matches with its OID, but some
operations, like
<command>TRUNCATE</command>, <command>REINDEX</command>,
<command>CLUSTER</command> and some forms
of <command>ALTER TABLE</command>, can change the filenode while
preserving the OID.
-Avoid assuming that filenode and table OID are the same.

Suggest: Note that a table's filenode will normally be different than
the OID. For system tables, the initial filenode will be equal to the
table OID, but it will be different if the table has ever been
subjected to a rewriting operation, such as TRUNCATE, REINDEX,
CLUSTER, or some forms of ALTER TABLE. For user tables, even the
initial filenode will be different than the table OID.

--
Robert Haas
EDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jacob Champion 2022-09-20 17:15:14 Re: Patch proposal: make use of regular expressions for the username in pg_hba.conf
Previous Message Robert Haas 2022-09-20 16:49:46 Re: cataloguing NOT NULL constraints