Re: making relfilenodes 56 bits

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: making relfilenodes 56 bits
Date: 2022-07-11 16:19:31
Message-ID: CA+TgmobzZ99smwb09O-pJL2B+EyMzp-WUe=vaoCUFFE7Y3DmDg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Jul 11, 2022 at 7:39 AM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> > buf_init.c:119:4: error: implicit truncation from 'int' to bit-field
> > changes value from -1 to 255 [-Werror,-Wbitfield-constant-conversion]
> > CLEAR_BUFFERTAG(buf->tag);
> > ^~~~~~~~~~~~~~~~~~~~~~~~~
> > ../../../../src/include/storage/buf_internals.h:122:14: note: expanded
> > from macro 'CLEAR_BUFFERTAG'
> > (a).forkNum = InvalidForkNumber, \
> > ^ ~~~~~~~~~~~~~~~~~
> > 1 error generated.
>
> Hmm so now we are using an unsigned int field so IMHO we can make
> InvalidForkNumber to 255 instead of -1?

If we're going to do that I think we had better do it as a separate,
preparatory patch.

It also makes me wonder why we're using macros rather than static
inline functions in buf_internals.h. I wonder whether we could do
something like this, for example, and keep InvalidForkNumber as -1:

static inline ForkNumber
BufTagGetForkNum(BufferTag *tagPtr)
{
int8 ret;

StaticAssertStmt(MAX_FORKNUM <= INT8_MAX);
ret = (int8) ((tagPtr->relForkDetails[0] >> BUFFERTAG_RELNUMBER_BITS);
return (ForkNumber) ret;
}

Even if we don't use that particular trick, I think we've generally
been moving toward using static inline functions rather than macros,
because it provides better type-safety and the code is often easier to
read. Maybe we should also approach it that way here. Or even commit a
preparatory patch replacing the existing macros with inline functions.
Or maybe it's best to leave it alone, not sure.

It feels like some of the changes to buf_internals.h in 0002 could be
moved into 0001. If we're going to introduce a combined method to set
the relnumber and fork, I think we could do that in 0001 rather than
making 0001 introduce a macro to set just the relfilenumber and then
having 0002 change it around again.

BUFFERTAG_RELNUMBER_BITS feels like a lie. It's defined to be 24, but
based on the name you'd expect it to be 56.

> But we are already logging this if we are setting the relfilenumber
> which is out of the already logged range, am I missing something?
> Check this change.
> + relnumbercount = relnumber - ShmemVariableCache->nextRelFileNumber;
> + if (ShmemVariableCache->relnumbercount <= relnumbercount)
> + {
> + LogNextRelFileNumber(relnumber + VAR_RELNUMBER_PREFETCH, NULL);
> + ShmemVariableCache->relnumbercount = VAR_RELNUMBER_PREFETCH;
> + }
> + else
> + ShmemVariableCache->relnumbercount -= relnumbercount;

Oh, I guess I missed that.

> I had those changes in v7-0003, now I have merged with 0002. This has
> assert check while replaying the WAL for smgr create and smgr
> truncate, and while during normal path when allocating the new
> relfilenumber we are asserting for any existing file.

I think a test-and-elog might be better. Most users won't be running
assert-enabled builds, but this seems worth checking regardless.

> I have done some performance tests, with very small values I can see a
> lot of wait events for RelFileNumberGen but with bigger numbers like
> 256 or 512 it is not really bad. See results at the end of the
> mail[1]

It's a little hard to interpret these results because you don't say
how often you were checking the wait events, or how often the
operation took to complete. I suppose we can guess the relative time
scale from the number of Activity events: if there were 190
WalWriterMain events observed, then the time to complete the operation
is probably 190 times how often you were checking the wait events, but
was that every second or every half second or every tenth of a second?

> I have done these changes during GetNewRelFileNumber() this required
> to track the last logged record pointer as well but I think this looks
> clean. With this I can see some reduction in RelFileNumberGen wait
> event[1]

I find the code you wrote here a little bit magical. I believe it
depends heavily on choosing to issue the new WAL record when we've
exhausted exactly 50% of the available space. I suggest having two
constants, one of which is the number of relfilenumber values per WAL
record, and the other of which is the threshold for issuing a new WAL
record. Maybe something like RFN_VALUES_PER_XLOG and
RFN_NEW_XLOG_THRESHOLD, or something. And then work code that works
correctly for any value of RFN_NEW_XLOG_THRESHOLD between 0 (don't log
new RFNs until old allocation is completely exhausted) and
RFN_VALUES_PER_XLOG - 1 (log new RFNs after using just 1 item from the
previous allocation). That way, if in the future someone decides to
change the constant values, they can do that and the code still works.

> I am not sure what is the best solution here, but I agree that most of
> the modern hardware will have bigger sector size than 512 so we can
> just change file size of 1024.

I went looking for previous discussion of this topic. Here's Heikki
doubting whether even 512 is too big:

http://postgr.es/m/f03d9166-ad12-2a3c-f605-c1873ee86ae4@iki.fi

Here's Thomas saying that he thinks it's probably mostly 4kB these
days, except when it isn't:

http://postgr.es/m/CAEepm=1e91zMk-vZszCOGDtKd=DhMLQjgENRSxcbSEhxuEPpfA@mail.gmail.com

Here's Tom with another idea how to reduce space usage:

http://postgr.es/m/7235.1566626302@sss.pgh.pa.us

It doesn't look to me like there's a consensus that some bigger number is safe.

> The current value of RELMAPPER_FILEMAGIC is 0x592717, I am not sure
> how this version ID is decide is this some random magic number or
> based on some logic?

Hmm, maybe we're not supposed to bump this value after all. I guess
maybe it's intended strictly as a magic number, rather than as a
version indicator. At least, we've never changed it up until now.

--
Robert Haas
EDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2022-07-11 16:25:44 Re: doc: Clarify what "excluded" represents for INSERT ON CONFLICT
Previous Message Andres Freund 2022-07-11 16:14:04 Re: automatically generating node support functions