Re: making relfilenodes 56 bits

From: Andres Freund <andres(at)anarazel(dot)de>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: making relfilenodes 56 bits
Date: 2022-07-11 18:57:36
Message-ID: 20220711185736.zc3ht4n56ceqmh6d@awork3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2022-07-07 13:26:29 -0400, Robert Haas wrote:
> We're trying to create a system where the relfilenumber counter is
> always ahead of all the relfilenumbers used on disk, but the coupling
> between the relfilenumber-advancement machinery and the
> make-files-on-disk machinery is pretty loose, and so there is a risk
> that bugs could escape detection. Whatever we can do to increase the
> probability of noticing when things have gone wrong, and/or to notice
> it quicker, will be good.

ISTM that we should redefine pg_class_tblspc_relfilenode_index to only cover
relfilenode - afaics there's no real connection to the tablespace
anymore. That'd a) reduce the size of the index b) guarantee uniqueness across
tablespaces.

I don't know where we could fit a sanity check that connects to all databases
and detects duplicates across all the pg_class instances. Perhaps pg_amcheck?

It may be worth changing RelidByRelfilenumber() / its infrastructure to not
use reltablespace anymore.

> One thing we could think about doing here is try to stagger the xlog
> and the flush. When we've used VAR_RELNUMBER_PREFETCH/2
> relfilenumbers, log a record reserving VAR_RELNUMBER_PREFETCH from
> where we are now, and remember the LSN. When we've used up our entire
> previous allocation, XLogFlush() that record before allowing the
> additional values to be used. The bookkeeping would be a bit more
> complicated than currently, but I don't think it would be too bad. I'm
> not sure how much it would actually help, though, or whether we need
> it.

I think that's a very good idea. My concern around doing an XLogFlush() is
that it could lead to a lot of tiny f[data]syncs(), because not much else
needs to be written out. But the scheme you describe would likely lead the
XLogFlush() flushing plenty other WAL writes out, addressing that.

> If new relfilenumbers are being used up really quickly, then maybe
> the record won't get flushed into the background before we run out of
> available numbers anyway, and if they aren't, then maybe it doesn't
> matter. On the other hand, even one transaction commit between when
> the record is logged and when we run out of the previous allocation is
> enough to force a flush, at least with synchronous_commit=on, so maybe
> the chances of being able to piggyback on an existing flush are not so
> bad after all. I'm not sure.

Even if the record isn't yet flushed out by the time we need to, the
deferred-ness means that there's a good chance more useful records can also be
flushed out at the same time...

> I notice that the patch makes no changes to relmapper.c, and I think
> that's a problem. Notice in particular:
>
> #define MAX_MAPPINGS 62 /* 62 * 8 + 16 = 512 */
>
> I believe that making RelFileNumber into a 64-bit value will cause the
> 8 in the calculation above to change to 16, defeating the intention
> that the size of the file ought to be the smallest imaginable size of
> a disk sector. It does seem like it would have been smart to include a
> StaticAssertStmt in this file someplace that checks that the data
> structure has the expected size, and now might be a good time, perhaps
> in a separate patch, to add one.

+1

Perhaps MAX_MAPPINGS should be at least partially computed instead of doing
the math in a comment? sizeof(RelMapping) could directly be used, and we could
define SIZEOF_RELMAPFILE_START with a StaticAssert() enforcing it to be equal
to offsetof(RelMapFile, mappings), if we move crc & pad to *before* mappings -
afaics that should be entirely doable.

> If we do nothing fancy here, the maximum number of mappings will have to be
> reduced from 62 to 31, which is a problem because global/pg_filenode.map
> currently has 48 entries. We could try to arrange to squeeze padding out of
> the RelMapping struct, which would let us use just 12 bytes per mapping,
> which would increase the limit to 41, but that's still less than we're using
> already, never mind leaving room for future growth.

Ugh.

> I don't know what to do about this exactly. I believe it's been
> previously suggested that the actual minimum sector size on reasonably
> modern hardware is never as small as 512 bytes, so maybe the file size
> can just be increased to 1kB or something.

I'm not so sure that's a good idea - while the hardware sector size likely
isn't 512 on much storage anymore, it's still the size that most storage
protocols use. Which then means you need to be confident that you not just
rely on storage atomicity, but also that nothing in the
filesystem <-> block layer <-> driver
stack somehow causes a single larger write to be split up into two.

And if you use a filesystem with a smaller filesystem block size, there might
not even be a choice for the write to be split into two writes. E.g. XFS still
supports 512 byte blocks (although I think it's deprecating block size < 1024).

Maybe the easiest fix here would be to replace the file atomically. Then we
don't need this <= 512 byte stuff. These are done rarely enough that I don't
think the overhead of creating a separate file, fsyncing that, renaming,
fsyncing, would be a problem?

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Ranier Vilela 2022-07-11 19:06:33 Re: Avoid unecessary MemSet call (src/backend/utils/cache/relcache.c)
Previous Message Daniel Gustafsson 2022-07-11 18:55:37 Re: Commitfest Update