Re: making relfilenodes 56 bits

From: Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>
To: Simon Riggs <simon(dot)riggs(at)enterprisedb(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Robert Haas <robertmhaas(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: making relfilenodes 56 bits
Date: 2022-06-29 22:12:46
Message-ID: CAEze2WgKaNhzckxGdjQ_V13E2nW6CiNyegHXWU+pAyGCnO6QXg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, 29 Jun 2022 at 14:41, Simon Riggs <simon(dot)riggs(at)enterprisedb(dot)com> wrote:
>
> On Tue, 28 Jun 2022 at 19:18, Matthias van de Meent
> <boekewurm+postgres(at)gmail(dot)com> wrote:
>
> > I will be the first to admit that it is quite unlikely to be common
> > practise, but this workload increases the number of dbOid+spcOid
> > combinations to 100s (even while using only a single tablespace),
>
> Which should still fit nicely in 32bits then. Why does that present a
> problem to this idea?

It doesn't, or at least not the bitspace part. I think it is indeed
quite unlikely anyone will try to build as many tablespaces as the 100
million tables project, which utilized 1000 tablespaces to get around
file system limitations [0].

The potential problem is 'where to store such mapping efficiently'.
Especially considering that this mapping might (and likely: will)
change across restarts and when database churn (create + drop
database) happens in e.g. testing workloads.

> The reason to mention this now is that it would give more space than
> 56bit limit being suggested here. I am not opposed to the current
> patch, just finding ways to remove some objections mentioned by
> others, if those became blockers.
>
> > which in my opinion requires some more thought than just handwaving it
> > into an smgr array and/or checkpoint records.
>
> The idea is that we would store the mapping as an array, with the
> value in the RelFileNode as the offset in the array. The array would
> be mostly static, so would cache nicely.

That part is not quite clear to me. Any cluster may have anywhere
between 3 and hundreds or thousands of entries in that mapping. Do you
suggest to dynamically grow that (presumably shared, considering the
addressing is shared) array, or have a runtime parameter limiting the
amount of those entries (similar to max_connections)?

> For convenience, I imagine that the mapping could be included in WAL
> in or near the checkpoint record, to ensure that the mapping was
> available in all backups.

Why would we need this mapping in backups, considering that it seems
to be transient state that is lost on restart? Won't we still use full
dbOid and spcOid in anything we communicate or store on disk (file
names, WAL, pg_class rows, etc.), or did I misunderstand your
proposal?

Kind regards,

Matthias van de Meent

[0] https://www.pgcon.org/2013/schedule/attachments/283_Billion_Tables_Project-PgCon2013.pdf

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Nathan Bossart 2022-06-29 23:19:39 Re: replacing role-level NOINHERIT with a grant-level option
Previous Message Thomas Munro 2022-06-29 22:07:18 Strange failures on chipmunk