Re: making relfilenodes 56 bits

From: Ashutosh Sharma <ashu(dot)coek88(at)gmail(dot)com>
To: Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Cc: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: making relfilenodes 56 bits
Date: 2022-07-27 07:54:07
Message-ID: CAE9k0PnGs2CMTnw3otfpxi54eVLvQVOf2uAx+DWH_uP0hLuTJw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Some more comments:

==

Shouldn't we retry for the new relfilenumber if
"ShmemVariableCache->nextRelFileNumber > MAX_RELFILENUMBER". There can be a
cases where some of the tables are dropped by the user and relfilenumber of
those tables can be reused for which we would need to find the
relfilenumber that can be resued. For e.g. consider below example:

postgres=# create table t1(a int);
CREATE TABLE

postgres=# create table t2(a int);
CREATE TABLE

postgres=# create table t3(a int);
ERROR: relfilenumber is out of bound

postgres=# drop table t1, t2;
DROP TABLE

postgres=# checkpoint;
CHECKPOINT

postgres=# vacuum;
VACUUM

Now if I try to recreate table t3, it should succeed, shouldn't it? But it
doesn't because we simply error out by seeing the nextRelFileNumber saved
in the shared memory.

postgres=# create table t1(a int);
ERROR: relfilenumber is out of bound

I think, above should have worked.

==

<caution>
<para>
Note that while a table's filenode often matches its OID, this is
<emphasis>not</emphasis> necessarily the case; some operations, like
<command>TRUNCATE</command>, <command>REINDEX</command>,
<command>CLUSTER</command> and some forms
of <command>ALTER TABLE</command>, can change the filenode while preserving
the OID.

I think this note needs some improvement in storage.sgml. It says the
table's relfilenode mostly matches its OID, but it doesn't. This will
happen only in case of system table and maybe never in case of user table.

==

postgres=# create table t2(a int);
ERROR: relfilenumber is out of bound

Since this is a user-visible error, I think it would be good to mention
relfilenode instead of relfilenumber. Elsewhere (including the user manual)
we refer to this as a relfilenode.

--
With Regards,
Ashutosh Sharma.

On Tue, Jul 26, 2022 at 10:36 PM Ashutosh Sharma <ashu(dot)coek88(at)gmail(dot)com>
wrote:

> Thanks Dilip. Here are few comments that could find upon quickly
> reviewing the v11 patch:
>
> /*
> + * Similar to the XLogPutNextOid but instead of writing NEXTOID log
> record it
> + * writes a NEXT_RELFILENUMBER log record. If '*prevrecptr' is a valid
> + * XLogRecPtrthen flush the wal upto this record pointer otherwise flush
> upto
>
> XLogRecPtrthen -> XLogRecPtr then
>
> ==
>
> + switch (relpersistence)
> + {
> + case RELPERSISTENCE_TEMP:
> + backend = BackendIdForTempRelations();
> + break;
> + case RELPERSISTENCE_UNLOGGED:
> + case RELPERSISTENCE_PERMANENT:
> + backend = InvalidBackendId;
> + break;
> + default:
> + elog(ERROR, "invalid relpersistence: %c", relpersistence);
> + return InvalidRelFileNumber; /* placate compiler */
> + }
>
>
> I think the above check should be added at the beginning of the function
> for the reason that if we come to the default switch case we won't be
> acquiring the lwlock and do other stuff to get a new relfilenumber.
>
> ==
>
> - newrelfilenumber = GetNewRelFileNumber(newTableSpace, NULL,
> + * Generate a new relfilenumber. We cannot reuse the old relfilenumber
> + * because of the possibility that that relation will be moved back to
> the
>
> that that relation -> that relation.
>
> ==
>
> + * option_parse_relfilenumber
> + *
> + * Parse relfilenumber value for an option. If the parsing is successful,
> + * returns; if parsing fails, returns false.
> + */
>
> If parsing is successful, returns true;
>
> --
> With Regards,
> Ashutosh Sharma.
>
> On Tue, Jul 26, 2022 at 7:33 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
>
>> On Tue, Jul 26, 2022 at 6:06 PM Ashutosh Sharma <ashu(dot)coek88(at)gmail(dot)com>
>> wrote:
>>
>> Hi,
>> Note: please avoid top posting.
>>
>> > /*
>> > * If relfilenumber is unspecified by the caller then create
>> storage
>> > - * with oid same as relid.
>> > + * with relfilenumber same as relid if it is a system table
>> otherwise
>> > + * allocate a new relfilenumber. For more details read
>> comments atop
>> > + * FirstNormalRelFileNumber declaration.
>> > */
>> > if (!RelFileNumberIsValid(relfilenumber))
>> > - relfilenumber = relid;
>> > + {
>> > + relfilenumber = relid < FirstNormalObjectId ?
>> > + relid : GetNewRelFileNumber();
>> >
>> > Above code says that in the case of system table we want relfilenode to
>> be the same as object id. This technically means that the relfilenode or
>> oid for the system tables would not be exceeding 16383. However in the
>> below lines of code added in the patch, it says there is some chance for
>> the storage path of the user tables from the old cluster conflicting with
>> the storage path of the system tables in the new cluster. Assuming that the
>> OIDs for the user tables on the old cluster would start with 16384 (the
>> first object ID), I see no reason why there would be a conflict.
>>
>>
>> Basically, the above comment says that the initial system table
>> storage will be created with the same relfilenumber as Oid so you are
>> right that will not exceed 16383. And below code is explaining the
>> reason that in order to avoid the conflict with the user table from
>> the older cluster we do it this way. Otherwise, in the new design, we
>> have no intention to keep the relfilenode same as Oid. But during an
>> upgrade from the older cluster which is not following this new design
>> might have user table relfilenode which can conflict with the system
>> table in the new cluster so we have to ensure that with the new design
>> also when creating the initial cluster we keep the system table
>> relfilenode in low range and directly using Oid is the best idea for
>> this purpose instead of defining the completely new range and
>> maintaining a separate counter for that.
>>
>> > +/* ----------
>> > + * RelFileNumber zero is InvalidRelFileNumber.
>> > + *
>> > + * For the system tables (OID < FirstNormalObjectId) the initial
>> storage
>> > + * will be created with the relfilenumber same as their oid. And,
>> later for
>> > + * any storage the relfilenumber allocated by GetNewRelFileNumber()
>> will start
>> > + * at 100000. Thus, when upgrading from an older cluster, the
>> relation storage
>> > + * path for the user table from the old cluster will not conflict with
>> the
>> > + * relation storage path for the system table from the new cluster.
>> Anyway,
>> > + * the new cluster must not have any user tables while upgrading, so
>> we needn't
>> > + * worry about them.
>> > + * ----------
>> > + */
>> > +#define FirstNormalRelFileNumber ((RelFileNumber) 100000)
>> >
>> > ==
>> >
>> > When WAL logging the next object id we have the chosen the xlog
>> threshold value as 8192 whereas for relfilenode it is 512. Any reason for
>> choosing this low arbitrary value in case of relfilenumber?
>>
>> For Oid when we cross the max value we will wraparound, whereas for
>> relfilenumber we can not expect the wraparound for cluster lifetime.
>> So it is better not to log forward a really large number of
>> relfilenumber as we do for Oid. OTOH if we make it really low like 64
>> then we can is RelFIleNumberGenLock in wait event in very high
>> concurrency where from 32 backends we are continuously
>> creating/dropping tables. So we thought of choosing this number 512
>> so that it is not very low that can create the lock contention and it
>> is not very high so that we need to worry about wasting those many
>> relfilenumbers on the crash.
>>
>> --
>> Regards,
>> Dilip Kumar
>> EnterpriseDB: http://www.enterprisedb.com
>>
>

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message houzj.fnst@fujitsu.com 2022-07-27 07:57:39 RE: Perform streaming logical transactions by background workers and parallel apply
Previous Message Simon Riggs 2022-07-27 07:36:13 Re: [PATCH] Compression dictionaries for JSONB