Re: refactoring relation extension and BufferAlloc(), faster COPY

From: Andres Freund <andres(at)anarazel(dot)de>
To: Melanie Plageman <melanieplageman(at)gmail(dot)com>
Cc: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, vignesh C <vignesh21(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>, Robert Haas <robertmhaas(at)gmail(dot)com>
Subject: Re: refactoring relation extension and BufferAlloc(), faster COPY
Date: 2023-03-29 03:47:54
Message-ID: 20230329034754.elzpfwacqmapv6pj@awork3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2023-03-26 17:42:45 -0400, Melanie Plageman wrote:
> Below is my review of a slightly older version than you just posted --
> much of it you may have already addressed.

Far from all is already - thanks for the review!

> From 3a6c3f41000e057bae12ab4431e6bb1c5f3ec4b0 Mon Sep 17 00:00:00 2001
> From: Andres Freund <andres(at)anarazel(dot)de>
> Date: Mon, 20 Mar 2023 21:57:40 -0700
> Subject: [PATCH v5 01/15] createdb-using-wal-fixes
>
> This could use a more detailed commit message -- I don't really get what
> it is doing

It's a fix for a bug that I encountered while hacking on this, it since has
been committed.

commit 5df319f3d55d09fadb4f7e4b58c5b476a3aeceb4
Author: Andres Freund <andres(at)anarazel(dot)de>
Date: 2023-03-20 21:57:40 -0700

Fix memory leak and inefficiency in CREATE DATABASE ... STRATEGY WAL_LOG

RelationCopyStorageUsingBuffer() did not free the strategies used to access
the source / target relation. They memory was released at the end of the
transaction, but when using a template database with a lot of relations, the
temporary leak can become big prohibitively big.

RelationCopyStorageUsingBuffer() acquired the buffer for the target relation
with RBM_NORMAL, therefore requiring a read of a block guaranteed to be
zero. Use RBM_ZERO_AND_LOCK instead.

Reviewed-by: Robert Haas <robertmhaas(at)gmail(dot)com>
Discussion: https://postgr.es/m/20230321070113.o2vqqxogjykwgfrr@awork3.anarazel.de
Backpatch: 15-, where STRATEGY WAL_LOG was introduced

> From 6faba69c241fd5513022bb042c33af09d91e84a6 Mon Sep 17 00:00:00 2001
> From: Andres Freund <andres(at)anarazel(dot)de>
> Date: Wed, 1 Jul 2020 19:06:45 -0700
> Subject: [PATCH v5 02/15] Add some error checking around pinning
>
> ---
> src/backend/storage/buffer/bufmgr.c | 40 ++++++++++++++++++++---------
> src/include/storage/bufmgr.h | 1 +
> 2 files changed, 29 insertions(+), 12 deletions(-)
>

> diff --git a/src/backend/storage/buffer/bufmgr.c
> b/src/backend/storage/buffer/bufmgr.c
> index 95212a3941..fa20fab5a2 100644
> --- a/src/backend/storage/buffer/bufmgr.c
> +++ b/src/backend/storage/buffer/bufmgr.c
> @@ -4283,6 +4287,25 @@ ConditionalLockBuffer(Buffer buffer)
> LW_EXCLUSIVE);
> }
>
> +void
> +BufferCheckOneLocalPin(Buffer buffer)
> +{
> + if (BufferIsLocal(buffer))
> + {
> + /* There should be exactly one pin */
> + if (LocalRefCount[-buffer - 1] != 1)
> + elog(ERROR, "incorrect local pin count: %d",
> + LocalRefCount[-buffer - 1]);
> + }
> + else
> + {
> + /* There should be exactly one local pin */
> + if (GetPrivateRefCount(buffer) != 1)
>
> I'd rather this be an else if (was already like this, but, no reason not
> to change it now)

I don't like that much - it'd break the symmetry between local / non-local.

> +/*
> + * Zero a region of the file.
> + *
> + * Returns 0 on success, -1 otherwise. In the latter case errno is set to the
> + * appropriate error.
> + */
> +int
> +FileZero(File file, off_t offset, off_t amount, uint32 wait_event_info)
> +{
> + int returnCode;
> + ssize_t written;
> +
> + Assert(FileIsValid(file));
> + returnCode = FileAccess(file);
> + if (returnCode < 0)
> + return returnCode;
> +
> + pgstat_report_wait_start(wait_event_info);
> + written = pg_pwrite_zeros(VfdCache[file].fd, amount, offset);
> + pgstat_report_wait_end();
> +
> + if (written < 0)
> + return -1;
> + else if (written != amount)
>
> this doesn't need to be an else if

You mean it could be a "bare" if instead? I don't really think that's clearer.

> + {
> + /* if errno is unset, assume problem is no disk space */
> + if (errno == 0)
> + errno = ENOSPC;
> + return -1;
> + }
>
> +int
> +FileFallocate(File file, off_t offset, off_t amount, uint32 wait_event_info)
> +{
> +#ifdef HAVE_POSIX_FALLOCATE
> + int returnCode;
> +
> + Assert(FileIsValid(file));
> + returnCode = FileAccess(file);
> + if (returnCode < 0)
> + return returnCode;
> +
> + pgstat_report_wait_start(wait_event_info);
> + returnCode = posix_fallocate(VfdCache[file].fd, offset, amount);
> + pgstat_report_wait_end();
> +
> + if (returnCode == 0)
> + return 0;
> +
> + /* for compatibility with %m printing etc */
> + errno = returnCode;
> +
> + /*
> + * Return in cases of a "real" failure, if fallocate is not supported,
> + * fall through to the FileZero() backed implementation.
> + */
> + if (returnCode != EINVAL && returnCode != EOPNOTSUPP)
> + return returnCode;
>
> I'm pretty sure you can just delete the below if statement
>
> + if (returnCode == 0 ||
> + (returnCode != EINVAL && returnCode != EINVAL))
> + return returnCode;

Hm. I don't see how - wouldn't that lead us to call FileZero(), even if
FileFallocate() succeeded or failed (rather than not being supported)?

>
> +/*
> + * mdzeroextend() -- Add ew zeroed out blocks to the specified relation.
>
> not sure what ew is

A hurried new :)

> + *
> + * Similar to mdrextend(), except the relation can be extended by
>
> mdrextend->mdextend

> + * multiple blocks at once, and that the added blocks will be
> filled with
>
> I would lose the comma and just say "and the added blocks will be filled..."

Done.

> +void
> +mdzeroextend(SMgrRelation reln, ForkNumber forknum,
> + BlockNumber blocknum, int nblocks, bool skipFsync)
>
> So, I think there are a few too many local variables in here, and it
> actually makes it more confusing.
> Assuming you would like to keep the input parameters blocknum and
> nblocks unmodified for debugging/other reasons, here is a suggested
> refactor of this function

I'm mostly adopting this.

> Also, I think you can combine the two error cases (I don't know if the
> user cares what you were trying to extend the file with).

Hm. I do find it a somewhat useful distinction for figuring out problems - we
haven't used posix_fallocate for files so far, it seems plausible we'd hit
some portability issues. We could make it an errdetail(), I guess?

> void
> mdzeroextend(SMgrRelation reln, ForkNumber forknum,
> BlockNumber blocknum, int nblocks, bool skipFsync)
> {
> MdfdVec *v;
> BlockNumber curblocknum = blocknum;
> int remblocks = nblocks;
> Assert(nblocks > 0);
>
> /* This assert is too expensive to have on normally ... */
> #ifdef CHECK_WRITE_VS_EXTEND
> Assert(blocknum >= mdnblocks(reln, forknum));
> #endif
>
> /*
> * If a relation manages to grow to 2^32-1 blocks, refuse to extend it any
> * more --- we mustn't create a block whose number actually is
> * InvalidBlockNumber or larger.
> */
> if ((uint64) blocknum + nblocks >= (uint64) InvalidBlockNumber)
> ereport(ERROR,
> (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
> errmsg("cannot extend file \"%s\" beyond %u blocks",
> relpath(reln->smgr_rlocator, forknum),
> InvalidBlockNumber)));
>
> while (remblocks > 0)
> {
> int segstartblock = curblocknum % ((BlockNumber)
> RELSEG_SIZE);

Hm - this shouldn't be an int - that's my fault, not yours...

> From ad7cd10a6c340d7f7d0adf26d5e39224dfd8439d Mon Sep 17 00:00:00 2001
> From: Andres Freund <andres(at)anarazel(dot)de>
> Date: Wed, 26 Oct 2022 12:05:07 -0700
> Subject: [PATCH v5 05/15] bufmgr: Add Pin/UnpinLocalBuffer()
>
> diff --git a/src/backend/storage/buffer/bufmgr.c
> b/src/backend/storage/buffer/bufmgr.c
> index fa20fab5a2..6f50dbd212 100644
> --- a/src/backend/storage/buffer/bufmgr.c
> +++ b/src/backend/storage/buffer/bufmgr.c
> @@ -4288,18 +4268,16 @@ ConditionalLockBuffer(Buffer buffer)
> }
>
> void
> -BufferCheckOneLocalPin(Buffer buffer)
> +BufferCheckWePinOnce(Buffer buffer)
>
> This name is weird. Who is we?

The current backend. I.e. the function checks the current backend pins the
buffer exactly once, rather that *any* backend pins it once.

I now see that BufferIsPinned() is named, IMO, misleadingly, more generally,
even though it also just applies to pins by the current backend.

> diff --git a/src/backend/storage/buffer/localbuf.c
> b/src/backend/storage/buffer/localbuf.c
> index 5325ddb663..798c5b93a8 100644
> --- a/src/backend/storage/buffer/localbuf.c
> +++ b/src/backend/storage/buffer/localbuf.c
> +bool
> +PinLocalBuffer(BufferDesc *buf_hdr, bool adjust_usagecount)
> +{
> + uint32 buf_state;
> + Buffer buffer = BufferDescriptorGetBuffer(buf_hdr);
> + int bufid = -(buffer + 1);
>
> You do
> int buffid = -buffer - 1;
> in UnpinLocalBuffer()
> They should be consistent.
>
> int bufid = -(buffer + 1);
>
> I think this version is better:
>
> int buffid = -buffer - 1;
>
> Since if buffer is INT_MAX, then the -(buffer + 1) version invokes
> undefined behavior while the -buffer - 1 version doesn't.

You are right! Not sure what I was doing there...

Ah - turns out there's pre-existing code in localbuf.c that do it that way :(
See at least MarkLocalBufferDirty().

We really need to wrap this in something, rather than open coding it all over
bufmgr.c/localbuf.c. I really dislike this indexing :(.

> From a0228218e2ac299aac754eeb5b2be7ddfc56918d Mon Sep 17 00:00:00 2001
> From: Andres Freund <andres(at)anarazel(dot)de>
> Date: Fri, 17 Feb 2023 18:26:34 -0800
> Subject: [PATCH v5 07/15] bufmgr: Acquire and clean victim buffer separately
>
> Previously we held buffer locks for two buffer mapping partitions at the same
> time to change the identity of buffers. Particularly for extending relations
> needing to hold the extension lock while acquiring a victim buffer is
> painful. By separating out the victim buffer acquisition, future commits will
> be able to change relation extensions to scale better.
>
> diff --git a/src/backend/storage/buffer/bufmgr.c
> b/src/backend/storage/buffer/bufmgr.c
> index 3d0683593f..ea423ae484 100644
> --- a/src/backend/storage/buffer/bufmgr.c
> +++ b/src/backend/storage/buffer/bufmgr.c
> @@ -1200,293 +1200,111 @@ BufferAlloc(SMgrRelation smgr, char
> relpersistence, ForkNumber forkNum,
>
> /*
> * Buffer contents are currently invalid. Try to obtain the right to
> * start I/O. If StartBufferIO returns false, then someone else managed
> * to read it before we did, so there's nothing left for BufferAlloc() to
> * do.
> */
> - if (StartBufferIO(buf, true))
> + if (StartBufferIO(victim_buf_hdr, true))
> *foundPtr = false;
> else
> *foundPtr = true;
>
> I know it was already like this, but since you edited the line already,
> can we just make this this now?
>
> *foundPtr = !StartBufferIO(victim_buf_hdr, true);

Hm, I do think it's easier to review if largely unchanged code is just moved
around, rather also rewritten. So I'm hesitant to do that here.

> @@ -1595,6 +1413,237 @@ retry:
> StrategyFreeBuffer(buf);
> }
>
> +/*
> + * Helper routine for GetVictimBuffer()
> + *
> + * Needs to be called on a buffer with a valid tag, pinned, but without the
> + * buffer header spinlock held.
> + *
> + * Returns true if the buffer can be reused, in which case the buffer is only
> + * pinned by this backend and marked as invalid, false otherwise.
> + */
> +static bool
> +InvalidateVictimBuffer(BufferDesc *buf_hdr)
> +{
> + /*
> + * Clear out the buffer's tag and flags and usagecount. This is not
> + * strictly required, as BM_TAG_VALID/BM_VALID needs to be checked before
> + * doing anything with the buffer. But currently it's beneficial as the
> + * pre-check for several linear scans of shared buffers just checks the
> + * tag.
>
> I don't really understand the above comment -- mainly the last sentence.

To start with, it's s/checks/check/

"linear scans" is a reference to functions like DropRelationBuffers(), which
iterate over all buffers, and just check the tag for a match. If we leave the
tag around, it'll still work, as InvalidateBuffer() etc will figure out that
the buffer is invalid. But of course that's slower then just skipping the
buffer "early on".

> +static Buffer
> +GetVictimBuffer(BufferAccessStrategy strategy, IOContext io_context)
> +{
> + BufferDesc *buf_hdr;
> + Buffer buf;
> + uint32 buf_state;
> + bool from_ring;
> +
> + /*
> + * Ensure, while the spinlock's not yet held, that there's a free refcount
> + * entry.
> + */
> + ReservePrivateRefCountEntry();
> + ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
> +
> + /* we return here if a prospective victim buffer gets used concurrently */
> +again:
>
> Why use goto instead of a loop here (again is the goto label)?

I find it way more readable this way. I'd use a loop if it were the common
case to loop, but it's the rare case, and for that I find the goto more
readable.

> @@ -4709,8 +4704,6 @@ TerminateBufferIO(BufferDesc *buf, bool
> clear_dirty, uint32 set_flag_bits)
> {
> uint32 buf_state;
>
> I noticed that the comment above TermianteBufferIO() says
> * TerminateBufferIO: release a buffer we were doing I/O on
> * (Assumptions)
> * My process is executing IO for the buffer
>
> Can we still say this is an assumption? What about when it is being
> cleaned up after being called from AbortBufferIO()

That hasn't really changed - it was already called by AbortBufferIO().

I think it's still correct, too. We must have marked the IO as being in
progress to get there.

> diff --git a/src/backend/utils/resowner/resowner.c
> b/src/backend/utils/resowner/resowner.c
> index 19b6241e45..fccc59b39d 100644
> --- a/src/backend/utils/resowner/resowner.c
> +++ b/src/backend/utils/resowner/resowner.c
> @@ -121,6 +121,7 @@ typedef struct ResourceOwnerData
>
> /* We have built-in support for remembering: */
> ResourceArray bufferarr; /* owned buffers */
> + ResourceArray bufferioarr; /* in-progress buffer IO */
> ResourceArray catrefarr; /* catcache references */
> ResourceArray catlistrefarr; /* catcache-list pins */
> ResourceArray relrefarr; /* relcache references */
> @@ -441,6 +442,7 @@ ResourceOwnerCreate(ResourceOwner parent, const char *name)
>
> Maybe worth mentioning in-progress buffer IO in resowner README? I know
> it doesn't claim to be exhaustive, so, up to you.

Hm. Given the few types of resources mentioned in the README, I don't think
it's worth doing so.

> Also, I realize that existing code in this file has the extraneous
> parantheses, but maybe it isn't worth staying consistent with that?
> as in: &(owner->bufferioarr)

I personally don't find it worth being consistent with that, but if you /
others think it is, I'd be ok with adapting to that.

> From f26d1fa7e528d04436402aa8f94dc2442999dde3 Mon Sep 17 00:00:00 2001
> From: Andres Freund <andres(at)anarazel(dot)de>
> Date: Wed, 1 Mar 2023 13:24:19 -0800
> Subject: [PATCH v5 09/15] bufmgr: Move relation extension handling into
> ExtendBufferedRel{By,To,}
>
> diff --git a/src/backend/storage/buffer/bufmgr.c
> b/src/backend/storage/buffer/bufmgr.c
> index 3c95b87bca..4e07a5bc48 100644
> --- a/src/backend/storage/buffer/bufmgr.c
> +++ b/src/backend/storage/buffer/bufmgr.c
>
> +/*
> + * Extend relation by multiple blocks.
> + *
> + * Tries to extend the relation by extend_by blocks. Depending on the
> + * availability of resources the relation may end up being extended by a
> + * smaller number of pages (unless an error is thrown, always by at least one
> + * page). *extended_by is updated to the number of pages the relation has been
> + * extended to.
> + *
> + * buffers needs to be an array that is at least extend_by long. Upon
> + * completion, the first extend_by array elements will point to a pinned
> + * buffer.
> + *
> + * If EB_LOCK_FIRST is part of flags, the first returned buffer is
> + * locked. This is useful for callers that want a buffer that is guaranteed to
> + * be empty.
>
> This should document what the returned BlockNumber is.

Ok.

> Also, instead of having extend_by and extended_by, how about just having
> one which is set by the caller to the desired number to extend by and
> then overwritten in this function to the value it successfully extended
> by.

I had it that way at first - but I think it turned out to be more confusing.

> It would be nice if the function returned the number it extended by
> instead of the BlockNumber.

It's not actually free to get the block number from a buffer (it causes more
sharing of the BufferDesc cacheline, which then makes modifications of the
cacheline more expensive). We should work on removing all those
BufferGetBlockNumber(). So I don't want to introduce a new function that
requires using BufferGetBlockNumber().

So I don't think this would be an improvement.

> + Assert((eb.rel != NULL) ^ (eb.smgr != NULL));
>
> Can we turn these into !=
>
> Assert((eb.rel != NULL) != (eb.smgr != NULL));
>
> since it is easier to understand.

Done.

> + * Extend the relation so it is at least extend_to blocks large, read buffer
>
> Use of "read buffer" here is confusing. We only read the block if, after
> we try extending the relation, someone else already did so and we have
> to read the block they extended in, right?

That's one case, yes. I think there's also some unfortunate other case that
I'd like to get rid of. See my archeology at
https://postgr.es/m/20230223010147.32oir7sb66slqnjk%40awork3.anarazel.de

> + uint32 num_pages = lengthof(buffers);
> + BlockNumber first_block;
> +
> + if ((uint64) current_size + num_pages > extend_to)
> + num_pages = extend_to - current_size;
> +
> + first_block = ExtendBufferedRelCommon(eb, fork, strategy, flags,
> + num_pages, extend_to,
> + buffers, &extended_by);
> +
> + current_size = first_block + extended_by;
> + Assert(current_size <= extend_to);
> + Assert(num_pages != 0 || current_size >= extend_to);
> +
> + for (int i = 0; i < extended_by; i++)
> + {
> + if (first_block + i != extend_to - 1)
>
> Is there a way we could avoid pinning these other buffers to begin with
> (e.g. passing a parameter to ExtendBufferedRelCommon())

We can't avoid pinning them. We could make ExtendBufferedRelCommon() release
them though - but I'm not sure that'd be an improvement. I actually had a
flag for that temporarily, but

> + if (buffer == InvalidBuffer)
> + {
> + bool hit;
> +
> + Assert(extended_by == 0);
> + buffer = ReadBuffer_common(eb.smgr, eb.relpersistence,
> + fork, extend_to - 1, mode, strategy,
> + &hit);
> + }
> +
> + return buffer;
> +}
>
> Do we use compound literals? Here, this could be:
>
> buffer = ReadBuffer_common(eb.smgr, eb.relpersistence,
> fork, extend_to - 1, mode, strategy,
> &(bool) {0});
>
> To eliminate the extraneous hit variable.

We do use compound literals in a few places. However, I don't think it's a
good idea to pass a pointer to a temporary. At least I need to look up the
lifetime rules for those every time. And this isn't a huge win, so I wouldn't
go for it here.

> /*
> * ReadBuffer_common -- common logic for all ReadBuffer variants
> @@ -801,35 +991,36 @@ ReadBuffer_common(SMgrRelation smgr, char
> relpersistence, ForkNumber forkNum,
> bool found;
> IOContext io_context;
> IOObject io_object;
> - bool isExtend;
> bool isLocalBuf = SmgrIsTemp(smgr);
>
> *hit = false;
>
> + /*
> + * Backward compatibility path, most code should use
> + * ExtendRelationBuffered() instead, as acquiring the extension lock
> + * inside ExtendRelationBuffered() scales a lot better.
>
> Think these are old function names in the comment

Indeed.

> +static BlockNumber
> +ExtendBufferedRelShared(ExtendBufferedWhat eb,
> + ForkNumber fork,
> + BufferAccessStrategy strategy,
> + uint32 flags,
> + uint32 extend_by,
> + BlockNumber extend_upto,
> + Buffer *buffers,
> + uint32 *extended_by)
> +{
> + BlockNumber first_block;
> + IOContext io_context = IOContextForStrategy(strategy);
> +
> + LimitAdditionalPins(&extend_by);
> +
> + /*
> + * Acquire victim buffers for extension without holding extension lock.
> + * Writing out victim buffers is the most expensive part of extending the
> + * relation, particularly when doing so requires WAL flushes. Zeroing out
> + * the buffers is also quite expensive, so do that before holding the
> + * extension lock as well.
> + *
> + * These pages are pinned by us and not valid. While we hold the pin they
> + * can't be acquired as victim buffers by another backend.
> + */
> + for (uint32 i = 0; i < extend_by; i++)
> + {
> + Block buf_block;
> +
> + buffers[i] = GetVictimBuffer(strategy, io_context);
> + buf_block = BufHdrGetBlock(GetBufferDescriptor(buffers[i] - 1));
> +
> + /* new buffers are zero-filled */
> + MemSet((char *) buf_block, 0, BLCKSZ);
> + }
> +
> + /*
> + * Lock relation against concurrent extensions, unless requested not to.
> + *
> + * We use the same extension lock for all forks. That's unnecessarily
> + * restrictive, but currently extensions for forks don't happen often
> + * enough to make it worth locking more granularly.
> + *
> + * Note that another backend might have extended the relation by the time
> + * we get the lock.
> + */
> + if (!(flags & EB_SKIP_EXTENSION_LOCK))
> + {
> + LockRelationForExtension(eb.rel, ExclusiveLock);
> + eb.smgr = RelationGetSmgr(eb.rel);
> + }
> +
> + /*
> + * If requested, invalidate size cache, so that smgrnblocks asks the
> + * kernel.
> + */
> + if (flags & EB_CLEAR_SIZE_CACHE)
> + eb.smgr->smgr_cached_nblocks[fork] = InvalidBlockNumber;
>
> I don't see this in master, is it new?

Not really - it's just elsewhere. See vm_extend() and fsm_extend(). I could
move this part back into "Convert a few places to ExtendBufferedRelTo", but I
doin't think that'd be better.

> + * Flags influencing the behaviour of ExtendBufferedRel*
> + */
> +typedef enum ExtendBufferedFlags
> +{
> + /*
> + * Don't acquire extension lock. This is safe only if the relation isn't
> + * shared, an access exclusive lock is held or if this is the startup
> + * process.
> + */
> + EB_SKIP_EXTENSION_LOCK = (1 << 0),
> +
> + /* Is this extension part of recovery? */
> + EB_PERFORMING_RECOVERY = (1 << 1),
> +
> + /*
> + * Should the fork be created if it does not currently exist? This likely
> + * only ever makes sense for relation forks.
> + */
> + EB_CREATE_FORK_IF_NEEDED = (1 << 2),
> +
> + /* Should the first (possibly only) return buffer be returned locked? */
> + EB_LOCK_FIRST = (1 << 3),
> +
> + /* Should the smgr size cache be cleared? */
> + EB_CLEAR_SIZE_CACHE = (1 << 4),
> +
> + /* internal flags follow */
>
> I don't understand what this comment means ("internal flags follow")

Hm - just that the flags defined subsequently are for internal use, not for
callers to specify.

> + */
> +static int
> +heap_multi_insert_pages(HeapTuple *heaptuples, int done, int ntuples,
> Size saveFreeSpace)
> +{
> + size_t page_avail;
> + int npages = 0;
> +
> + page_avail = BLCKSZ - SizeOfPageHeaderData - saveFreeSpace;
> + npages++;
>
> can this not just be this:
>
> size_t page_avail = BLCKSZ - SizeOfPageHeaderData - saveFreeSpace;
> int npages = 1;

Yes.

>
> From 5d2be27caf8f4ee8f26841b2aa1674c90bd51754 Mon Sep 17 00:00:00 2001
> From: Andres Freund <andres(at)anarazel(dot)de>
> Date: Wed, 26 Oct 2022 14:14:11 -0700
> Subject: [PATCH v5 12/15] hio: Use ExtendBufferedRelBy()

> ---
> src/backend/access/heap/hio.c | 285 +++++++++++++++++-----------------
> 1 file changed, 146 insertions(+), 139 deletions(-)
>
> diff --git a/src/backend/access/heap/hio.c b/src/backend/access/heap/hio.c
> index 65886839e7..48cfcff975 100644
> --- a/src/backend/access/heap/hio.c
> +++ b/src/backend/access/heap/hio.c
> @@ -354,6 +270,9 @@ RelationGetBufferForTuple(Relation relation, Size len,
>
> so in RelationGetBufferForTuple() up above where your changes start,
> there is this code
>
>
> /*
> * We first try to put the tuple on the same page we last inserted a tuple
> * on, as cached in the BulkInsertState or relcache entry. If that
> * doesn't work, we ask the Free Space Map to locate a suitable page.
> * Since the FSM's info might be out of date, we have to be prepared to
> * loop around and retry multiple times. (To insure this isn't an infinite
> * loop, we must update the FSM with the correct amount of free space on
> * each page that proves not to be suitable.) If the FSM has no record of
> * a page with enough free space, we give up and extend the relation.
> *
> * When use_fsm is false, we either put the tuple onto the existing target
> * page or extend the relation.
> */
> if (bistate && bistate->current_buf != InvalidBuffer)
> {
> targetBlock = BufferGetBlockNumber(bistate->current_buf);
> }
> else
> targetBlock = RelationGetTargetBlock(relation);
>
> if (targetBlock == InvalidBlockNumber && use_fsm)
> {
> /*
> * We have no cached target page, so ask the FSM for an initial
> * target.
> */
> targetBlock = GetPageWithFreeSpace(relation, targetFreeSpace);
> }
>
> And, I was thinking how, ReadBufferBI() only has one caller now
> (RelationGetBufferForTuple()) and, this caller basically already has
> checked for the case in the inside of ReadBufferBI() (the code I pasted
> above)
>
> /* If we have the desired block already pinned, re-pin and return it */
> if (bistate->current_buf != InvalidBuffer)
> {
> if (BufferGetBlockNumber(bistate->current_buf) == targetBlock)
> {
> /*
> * Currently the LOCK variants are only used for extending
> * relation, which should never reach this branch.
> */
> Assert(mode != RBM_ZERO_AND_LOCK &&
> mode != RBM_ZERO_AND_CLEANUP_LOCK);
>
> IncrBufferRefCount(bistate->current_buf);
> return bistate->current_buf;
> }
> /* ... else drop the old buffer */
>
> So, I was thinking maybe there is some way to inline the logic for
> ReadBufferBI(), because I think it would feel more streamlined to me.

I don't really see how - I'd welcome suggestions?

> @@ -558,18 +477,46 @@ loop:
> ReleaseBuffer(buffer);
> }
>
> Oh, and I forget which commit introduced BulkInsertState->next_free and
> last_free, but I remember thinking that it didn't seem to fit with the
> other parts of that commit.

I'll move it into the one using ExtendBufferedRelBy().

> - /* Without FSM, always fall out of the loop and extend */
> - if (!use_fsm)
> - break;
> + if (bistate
> + && bistate->next_free != InvalidBlockNumber
> + && bistate->next_free <= bistate->last_free)
> + {
> + /*
> + * We bulk extended the relation before, and there are still some
> + * unused pages from that extension, so we don't need to look in
> + * the FSM for a new page. But do record the free space from the
> + * last page, somebody might insert narrower tuples later.
> + */
>
> Why couldn't we have found out that we bulk-extended before and get the
> block from there up above the while loop?

I'm not quite sure I follow - above the loop there might still have been space
on the prior page? We also need the ability to loop if the space has been used
since.

I guess there's an argument for also checking above the loop, but I don't
think that'd currently ever be reachable.

> + {
> + bistate->next_free = InvalidBlockNumber;
> + bistate->last_free = InvalidBlockNumber;
> + }
> + else
> + bistate->next_free++;
> + }
> + else if (!use_fsm)
> + {
> + /* Without FSM, always fall out of the loop and extend */
> + break;
> + }
>
> It would be nice to have a comment explaining why this is in its own
> else if instead of breaking earlier (i.e. !use_fsm is still a valid case
> in the if branch above it)

I'm not quite following. Breaking where earlier?

Note that that branch is old code, it's just that a new way of getting a page
that potentially has free space is preceding it.

> we can get rid of needLock and waitcount variables like this
>
> +#define MAX_BUFFERS 64
> + Buffer victim_buffers[MAX_BUFFERS];
> + BlockNumber firstBlock = InvalidBlockNumber;
> + BlockNumber firstBlockFSM = InvalidBlockNumber;
> + BlockNumber curBlock;
> + uint32 extend_by_pages;
> + uint32 no_fsm_pages;
> + uint32 waitcount;
> +
> + extend_by_pages = num_pages;
> +
> + /*
> + * Multiply the number of pages to extend by the number of waiters. Do
> + * this even if we're not using the FSM, as it does relieve
> + * contention. Pages will be found via bistate->next_free.
> + */
> + if (needLock)
> + waitcount = RelationExtensionLockWaiterCount(relation);
> + else
> + waitcount = 0;
> + extend_by_pages += extend_by_pages * waitcount;
>
> if (!RELATION_IS_LOCAL(relation))
> extend_by_pages += extend_by_pages *
> RelationExtensionLockWaiterCount(relation);

I guess I find it useful to be able to quickly add logging messages for stuff
like this. I don't think local variables are as bad as you make them out to be
:)

Thanks for the review!

Andres

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2023-03-29 04:13:55 Re: refactoring relation extension and BufferAlloc(), faster COPY
Previous Message Kyotaro Horiguchi 2023-03-29 03:09:08 Re: Should vacuum process config file reload more often