Re: making relfilenodes 56 bits

From: Dilip Kumar <dilipbalaut(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: making relfilenodes 56 bits
Date: 2022-07-11 11:39:11
Message-ID: CAFiTN-t9j=z423_UcvrdwD0Hry5GBuSXJEU72sftRec9dM-1ZA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Jul 7, 2022 at 10:56 PM Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

I have accepted all the suggestion, find my inline replies where we
need more thoughts.

> buf_init.c:119:4: error: implicit truncation from 'int' to bit-field
> changes value from -1 to 255 [-Werror,-Wbitfield-constant-conversion]
> CLEAR_BUFFERTAG(buf->tag);
> ^~~~~~~~~~~~~~~~~~~~~~~~~
> ../../../../src/include/storage/buf_internals.h:122:14: note: expanded
> from macro 'CLEAR_BUFFERTAG'
> (a).forkNum = InvalidForkNumber, \
> ^ ~~~~~~~~~~~~~~~~~
> 1 error generated.

Hmm so now we are using an unsigned int field so IMHO we can make
InvalidForkNumber to 255 instead of -1?

> > I think it is in line with oidCount, what do you think?
>
> Oh it definitely is, and maybe it's OK the way you have it. But the
> OID stuff has wraparound to worry about, and this doesn't; and this
> has the SetNextRelFileNumber and that doesn't; so it is not
> necessarily the case that the design which is best for that case is
> also best for this case.

Yeah right, but now with the latest changes for piggybacking the
XlogFlush I think it is cleaner to have the count.

> I believe that the persistence model for SetNextRelFileNumber needs
> more thought. Right now I believe it's relying on the fact that, after
> we try to restore the dump, we'll try to perform a clean shutdown of
> the server before doing anything important, and that will persist the
> final value, whatever it ends up being. However, there's no comment
> explaining that theory of operation, and it seems pretty fragile
> anyway. What if things don't go as planned? Suppose the power goes out
> halfway through restoring the dump, and the user for some reason then
> gives up on running pg_upgrade and just tries to do random things with
> that server? Then I think there will be trouble, because nothing has
> updated the nextrelfilenumber value and yet there are potentially new
> files on disk. Maybe that's a stretch since I think other things might
> also break if you do that, but I'm also not sure that's the only
> scenario to worry about, especially if you factor in the possibility
> of future code changes, like changes to the timing of when we shut
> down and restart the server during pg_upgrade, or other uses of
> binary-upgrade mode, or whatever. I don't know. Perhaps it's not
> actually broken but I'm inclined to think it should be logging its
> changes.

But we are already logging this if we are setting the relfilenumber
which is out of the already logged range, am I missing something?
Check this change.
+ relnumbercount = relnumber - ShmemVariableCache->nextRelFileNumber;
+ if (ShmemVariableCache->relnumbercount <= relnumbercount)
+ {
+ LogNextRelFileNumber(relnumber + VAR_RELNUMBER_PREFETCH, NULL);
+ ShmemVariableCache->relnumbercount = VAR_RELNUMBER_PREFETCH;
+ }
+ else
+ ShmemVariableCache->relnumbercount -= relnumbercount;

> A related thought is that I don't think this patch has as many
> cross-checks as it could have. For instance, suppose that when we
> replay a WAL record that creates relation storage, we cross-check that
> the value is less than the counter. I think you have a check in there
> someplace that will error out if there is an actual collision --
> although I can't find it at the moment, and possibly we want to add
> some comments there even if it's in existing code -- but this kind of
> thing would detect bugs that could lead to collisions even if no
> collision actually occurs, e.g. because a duplicate relfilenumber is
> used but in a different database or tablespace. It might be worth
> spending some time thinking about other possible cross-checks too.
> We're trying to create a system where the relfilenumber counter is
> always ahead of all the relfilenumbers used on disk, but the coupling
> between the relfilenumber-advancement machinery and the
> make-files-on-disk machinery is pretty loose, and so there is a risk
> that bugs could escape detection. Whatever we can do to increase the
> probability of noticing when things have gone wrong, and/or to notice
> it quicker, will be good.

I had those changes in v7-0003, now I have merged with 0002. This has
assert check while replaying the WAL for smgr create and smgr
truncate, and while during normal path when allocating the new
relfilenumber we are asserting for any existing file.

> One thing that isn't great about this whole scheme is that it can lead
> to lock pile-ups. Once somebody is waiting for an
> XLOG_NEXT_RELFILENUMBER record to reach the disk, any other backend
> that tries to get a new relfilenumber is going to block waiting for
> RelFileNumberGenLock. I wonder whether this effect is observable in
> practice: suppose we just create relations in a tight loop from inside
> a stored procedure, and do that simultaneously in multiple backends?
> What does the wait event distribution look like? Can we observe a lot
> of RelFileNumberGenLock events or not really? I guess if we reduce
> VAR_RELNUMBER_PREFETCH enough we can probably create a problem, but
> how small a value is needed?

I have done some performance tests, with very small values I can see a
lot of wait events for RelFileNumberGen but with bigger numbers like
256 or 512 it is not really bad. See results at the end of the
mail[1]

> One thing we could think about doing here is try to stagger the xlog
> and the flush. When we've used VAR_RELNUMBER_PREFETCH/2
> relfilenumbers, log a record reserving VAR_RELNUMBER_PREFETCH from
> where we are now, and remember the LSN. When we've used up our entire
> previous allocation, XLogFlush() that record before allowing the
> additional values to be used. The bookkeeping would be a bit more
> complicated than currently, but I don't think it would be too bad. I'm
> not sure how much it would actually help, though, or whether we need
> it. If new relfilenumbers are being used up really quickly, then maybe
> the record won't get flushed into the background before we run out of
> available numbers anyway, and if they aren't, then maybe it doesn't
> matter. On the other hand, even one transaction commit between when
> the record is logged and when we run out of the previous allocation is
> enough to force a flush, at least with synchronous_commit=on, so maybe
> the chances of being able to piggyback on an existing flush are not so
> bad after all. I'm not sure.

I have done these changes during GetNewRelFileNumber() this required
to track the last logged record pointer as well but I think this looks
clean. With this I can see some reduction in RelFileNumberGen wait
event[1]

> In theory I suppose there's another way we could solve this problem:
> keep using the same relfilenumber, and if the scenario described here
> occurs, just reuse the old file. The reason why we can't do that today
> is because we could be running with wal_level=minimal and replace a
> relation with one whose contents aren't logged. If WAL replay then
> replays the drop, we're in trouble. But if the only time we reuse a
> relfilenumber for new relation storage is when relations are moved
> around, then I think that scenario can't happen. However, I think
> assigning a new relfilenumber is probably better, because it gets us
> closer to a world in which relfilenumbers are never reused at all. It
> doesn't get us all the way there because of createdb() and movedb(),
> but it gets us closer and I prefer that.

I agree with you.

> I notice that the patch makes no changes to relmapper.c, and I think
> that's a problem. Notice in particular:
>
> #define MAX_MAPPINGS 62 /* 62 * 8 + 16 = 512 */
>
> I believe that making RelFileNumber into a 64-bit value will cause the
> 8 in the calculation above to change to 16, defeating the intention
> that the size of the file ought to be the smallest imaginable size of
> a disk sector. It does seem like it would have been smart to include a
> StaticAssertStmt in this file someplace that checks that the data
> structure has the expected size, and now might be a good time, perhaps
> in a separate patch, to add one. If we do nothing fancy here, the
> maximum number of mappings will have to be reduced from 62 to 31,
> which is a problem because global/pg_filenode.map currently has 48
> entries. We could try to arrange to squeeze padding out of the
> RelMapping struct, which would let us use just 12 bytes per mapping,
> which would increase the limit to 41, but that's still less than we're
> using already, never mind leaving room for future growth.
>
> I don't know what to do about this exactly. I believe it's been
> previously suggested that the actual minimum sector size on reasonably
> modern hardware is never as small as 512 bytes, so maybe the file size
> can just be increased to 1kB or something. If that idea is judged
> unsafe, I can think of two other possible approaches offhand. One is
> that we could move away from the idea of storing the OIDs in the file
> along with the RelFileNodes, and instead store the offset for a given
> RelFileNode at a fixed offset in the file. That would require either
> hard-wiring offset tables into the code someplace, or generating them
> as part of the build process, with separate tables for shared and
> database-local relation map files. The other is that we could have
> multiple 512-byte sectors and try to arrange for each relation to be
> in the same sector with the indexes of that relation, since the
> comments in relmapper.c say this:
>
> * aborts. An important factor here is that the indexes and toast table of
> * a mapped catalog must also be mapped, so that the rewrites/relocations of
> * all these files commit in a single map file update rather than being tied
> * to transaction commit.
>
> This suggests that atomicity is required across a table and its
> indexes, but that it's needed across arbitrary sets of entries in the
> file.
>
> Whatever we do, we shouldn't forget to bump RELMAPPER_FILEMAGIC.

I am not sure what is the best solution here, but I agree that most of
the modern hardware will have bigger sector size than 512 so we can
just change file size of 1024.

The current value of RELMAPPER_FILEMAGIC is 0x592717, I am not sure
how this version ID is decide is this some random magic number or
based on some logic?

>
> + uint32 relNumber_low; /* relfilenumber 32 lower bits */
> + uint32 relNumber_hi:24; /* relfilenumber 24 high bits */
> + uint32 forkNum:8; /* fork number */
>
> I still think we'd be better off with something like uint32
> relForkDetails[2]. The bitfields would be nice if they meant that we
> didn't have to do bit-shifting and masking operations ourselves, but
> with the field split this way, we do anyway. So what's the point in
> mixing the approaches?

Actually with this we were able to access the forkNum directly, but I
also think changing as relForkDetails[2] is cleaner so done that. And
as part of the related changes in 0001 I have removed the direct
access to the forkNum.

[1] Wait event details

Procedure:
CREATE OR REPLACE FUNCTION create_table(count int) RETURNS void AS $$
DECLARE
relname varchar;
pid int;
i int;
BEGIN
SELECT pg_backend_pid() INTO pid;
relname := 'test_' || pid;
FOR i IN 1..count LOOP
EXECUTE format('CREATE TABLE %s(a int)', relname);

EXECUTE format('DROP TABLE %s', relname);
END LOOP;
END;

Target test: Executed "select create_table(100);" query from pgbench
with 32 concurrent backends.

VAR_RELNUMBER_PREFETCH = 8

905 LWLock | LockManager
346 LWLock | RelFileNumberGen
192
190 Activity | WalWriterMain

VAR_RELNUMBER_PREFETCH=128
1187 LWLock | LockManager
247 LWLock | RelFileNumberGen
139 Activity | CheckpointerMain

VAR_RELNUMBER_PREFETCH=256

1029 LWLock | LockManager
158 LWLock | BufferContent
134 Activity | CheckpointerMain
134 Activity | AutoVacuumMain
133 Activity | BgWriterMain
132 Activity | WalWriterMain
130 Activity | LogicalLauncherMain
123 LWLock | RelFileNumberGen

VAR_RELNUMBER_PREFETCH=512

1174 LWLock | LockManager
136 Activity | CheckpointerMain
136 Activity | BgWriterMain
136 Activity | AutoVacuumMain
134 Activity | WalWriterMain
134 Activity | LogicalLauncherMain
99 LWLock | BufferContent
35 LWLock | RelFileNumberGen

VAR_RELNUMBER_PREFETCH=2048
1070 LWLock | LockManager
160 LWLock | BufferContent
156 Activity | CheckpointerMain
156
155 Activity | BgWriterMain
154 Activity | AutoVacuumMain
153 Activity | WalWriterMain
149 Activity | LogicalLauncherMain
31 LWLock | RelFileNumberGen
28 Timeout | VacuumDelay

VAR_RELNUMBER_PREFETCH=4096
Note, no wait event for RelFileNumberGen at value 4096

New patch with piggybacking XLogFlush()

VAR_RELNUMBER_PREFETCH = 8

1105 LWLock | LockManager
143 LWLock | BufferContent
140 Activity | CheckpointerMain
140 Activity | BgWriterMain
139 Activity | WalWriterMain
138 Activity | AutoVacuumMain
137 Activity | LogicalLauncherMain
115 LWLock | RelFileNumberGen

VAR_RELNUMBER_PREFETCH = 256
1130 LWLock | LockManager
141 Activity | CheckpointerMain
139 Activity | BgWriterMain
137 Activity | AutoVacuumMain
136 Activity | LogicalLauncherMain
135 Activity | WalWriterMain
69 LWLock | BufferContent
31 LWLock | RelFileNumberGen

VAR_RELNUMBER_PREFETCH = 1024
Note: no wait event for RelFileNumberGen at value 1024

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment Content-Type Size
v8-0001-Preliminary-refactoring-for-supporting-larger-rel.patch text/x-patch 22.6 KB
v8-0002-Widen-relfilenumber-from-32-bits-to-56-bits.patch text/x-patch 88.7 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Nikita Malakhov 2022-07-11 12:03:50 Re: Pluggable toaster
Previous Message Amit Kapila 2022-07-11 11:33:34 Re: Handle infinite recursion in logical replication setup