Re: making relfilenodes 56 bits

From: Andres Freund <andres(at)anarazel(dot)de>
To: Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: making relfilenodes 56 bits
Date: 2022-07-02 04:08:40
Message-ID: 20220702040840.6hkhdr6nv3u2pqc3@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

I'm not feeling inspired by "locator", tbh. But I don't really have a great
alternative, so ...

On 2022-07-01 16:12:01 +0530, Dilip Kumar wrote:
> From f07ca9ef19e64922c6ee410707e93773d1a01d7c Mon Sep 17 00:00:00 2001
> From: dilip kumar <dilipbalaut(at)localhost(dot)localdomain>
> Date: Sat, 25 Jun 2022 10:43:12 +0530
> Subject: [PATCH v4 2/4] Preliminary refactoring for supporting larger
> relfilenumber

I don't think we have abbreviated buffer as 'buff' in many places? I think we
should either spell buffer out or use 'buf'. This is in regard to BuffTag etc.

> Subject: [PATCH v4 3/4] Use 56 bits for relfilenumber to avoid wraparound

> /*
> + * GenerateNewRelFileNumber
> + *
> + * Similar to GetNewObjectId but instead of new Oid it generates new
> + * relfilenumber.
> + */
> +RelFileNumber
> +GetNewRelFileNumber(void)
> +{
> + RelFileNumber result;
> +
> + /* Safety check, we should never get this far in a HS standby */

Normally we don't capitalize the first character of a comment that's not a
full sentence (i.e. ending with a punctuation mark).

> + if (RecoveryInProgress())
> + elog(ERROR, "cannot assign RelFileNumber during recovery");
> +
> + LWLockAcquire(RelFileNumberGenLock, LW_EXCLUSIVE);
> +
> + /* Check for the wraparound for the relfilenumber counter */
> + if (unlikely (ShmemVariableCache->nextRelFileNumber > MAX_RELFILENUMBER))
> + elog(ERROR, "relfilenumber is out of bound");
> +
> + /* If we run out of logged for use RelFileNumber then we must log more */

"logged for use" - looks like you reformulated the sentence incompletely.

> + if (ShmemVariableCache->relnumbercount == 0)
> + {
> + XLogPutNextRelFileNumber(ShmemVariableCache->nextRelFileNumber +
> + VAR_RFN_PREFETCH);

I know this is just copied, but I find "XLogPut" as a prefix pretty unhelpful.

> diff --git a/src/include/catalog/pg_class.h b/src/include/catalog/pg_class.h
> index e1f4eef..1cf039c 100644
> --- a/src/include/catalog/pg_class.h
> +++ b/src/include/catalog/pg_class.h
> @@ -31,6 +31,10 @@
> */
> CATALOG(pg_class,1259,RelationRelationId) BKI_BOOTSTRAP BKI_ROWTYPE_OID(83,RelationRelation_Rowtype_Id) BKI_SCHEMA_MACRO
> {
> + /* identifier of physical storage file */
> + /* relfilenode == 0 means it is a "mapped" relation, see relmapper.c */
> + int64 relfilenode BKI_DEFAULT(0);
> +
> /* oid */
> Oid oid;
>
> @@ -52,10 +56,6 @@ CATALOG(pg_class,1259,RelationRelationId) BKI_BOOTSTRAP BKI_ROWTYPE_OID(83,Relat
> /* access method; 0 if not a table / index */
> Oid relam BKI_DEFAULT(heap) BKI_LOOKUP_OPT(pg_am);
>
> - /* identifier of physical storage file */
> - /* relfilenode == 0 means it is a "mapped" relation, see relmapper.c */
> - Oid relfilenode BKI_DEFAULT(0);
> -
> /* identifier of table space for relation (0 means default for database) */
> Oid reltablespace BKI_DEFAULT(0) BKI_LOOKUP_OPT(pg_tablespace);
>

What's the story behind moving relfilenode to the front? Alignment
consideration? Seems odd to move the relfilenode before the oid. If there's an
alignment issue, can't you just swap it with reltablespace or such to resolve
it?

> From f6e8e0e7412198b02671e67d1859a7448fe83f38 Mon Sep 17 00:00:00 2001
> From: dilip kumar <dilipbalaut(at)localhost(dot)localdomain>
> Date: Wed, 29 Jun 2022 13:24:32 +0530
> Subject: [PATCH v4 4/4] Don't delay removing Tombstone file until next
> checkpoint
>
> Currently, we can not remove the unused relfilenode until the
> next checkpoint because if we remove them immediately then
> there is a risk of reusing the same relfilenode for two
> different relations during single checkpoint due to Oid
> wraparound.

Well, not quite "currently", because at this point we've fixed that in a prior
commit ;)

> Now as part of the previous patch set we have made relfilenode
> 56 bit wider and removed the risk of wraparound so now we don't
> need to wait till the next checkpoint for removing the unused
> relation file and we can clean them up on commit.

Hm. Wasn't there also some issue around crash-restarts benefiting from having
those files around until later? I think what I'm remembering is what is
referenced in this comment:

> - * For regular relations, we don't unlink the first segment file of the rel,
> - * but just truncate it to zero length, and record a request to unlink it after
> - * the next checkpoint. Additional segments can be unlinked immediately,
> - * however. Leaving the empty file in place prevents that relfilenumber
> - * from being reused. The scenario this protects us from is:
> - * 1. We delete a relation (and commit, and actually remove its file).
> - * 2. We create a new relation, which by chance gets the same relfilenumber as
> - * the just-deleted one (OIDs must've wrapped around for that to happen).
> - * 3. We crash before another checkpoint occurs.
> - * During replay, we would delete the file and then recreate it, which is fine
> - * if the contents of the file were repopulated by subsequent WAL entries.
> - * But if we didn't WAL-log insertions, but instead relied on fsyncing the
> - * file after populating it (as we do at wal_level=minimal), the contents of
> - * the file would be lost forever. By leaving the empty file until after the
> - * next checkpoint, we prevent reassignment of the relfilenumber until it's
> - * safe, because relfilenumber assignment skips over any existing file.

This isn't related to oid wraparound, just crashes. It's possible that the
XLogFlush() in XLogPutNextRelFileNumber() prevents such a scenario, but if so
it still ought to be explained here, I think.

> + * Note that now we can immediately unlink the first segment of the regular
> + * relation as well because the relfilenumber is 56 bits wide since PG 16. So
> + * we don't have to worry about relfilenumber getting reused for some unrelated
> + * relation file.

I'm doubtful it's a good idea to start dropping at the first segment. I'm
fairly certain that there's smgrexists() checks in some places, and they'll
now stop working, even if there are later segments that remained, e.g. because
of an error in the middle of removing later segments.

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2022-07-02 06:42:35 Re: Handle infinite recursion in logical replication setup
Previous Message Nathan Bossart 2022-07-02 03:40:20 Re: Assert name/short_desc to prevent SHOW ALL segfault