Quick Links

Re: Checkpointer write combining

From:	Melanie Plageman <melanieplageman(at)gmail(dot)com>
To:	Chao Li <li(dot)evan(dot)chao(at)gmail(dot)com>
Cc:	Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de>
Subject:	Re: Checkpointer write combining
Date:	2025-09-11 23:11:12
Message-ID:	CAAKRu_atZuU3gYY6hVVuvqDW+mfXM+fCWCyYODbZPesxyr=y6g@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Wed, Sep 10, 2025 at 4:24 AM Chao Li <li(dot)evan(dot)chao(at)gmail(dot)com> wrote:
>

Thanks for the review!

For any of your feedback that I simply implemented, I omitted an
inline comment about it. Those changes are included in the attached
v6. My inline replies below are only for feedback requiring more
discussion.

> On Sep 10, 2025, at 01:55, Melanie Plageman <melanieplageman(at)gmail(dot)com> wrote:
>
> 2 - 0001
> ```
> --- a/src/backend/storage/buffer/freelist.c
> +++ b/src/backend/storage/buffer/freelist.c
>
> + if (XLogNeedsFlush(lsn))
> + {
> + /*
> + * Remove the dirty buffer from the ring; necessary to prevent an
> + * infinite loop if all ring members are dirty.
> + */
> + strategy->buffers[strategy->current] = InvalidBuffer;
> + return true;
> + }
>
> - return true;
> + return false;
> }
> ```
>
> We can do:
> ```
> If (!XLogNeedsFlush(lan))
> Return false
>
> /* Remove the dirty buffer ….
> */
> Return true;
> }
> ```

This would make the order of evaluation the same as master but I
actually prefer it this way because then we only take the buffer
header spinlock if there is a chance we will reject the buffer (e.g.
we don't need to examine it for strategies except BAS_BULKREAD)

> 4 - 0002
> ```
> - /* OK, do the I/O */
> - FlushBuffer(buf_hdr, NULL, IOOBJECT_RELATION, io_context);
> - LWLockRelease(content_lock);
> -
> - ScheduleBufferTagForWriteback(&BackendWritebackContext, io_context,
> - &buf_hdr->tag);
> + CleanVictimBuffer(buf_hdr, &buf_state, from_ring, io_context);
> ```
> I saw CleanVictimBuffer() will get content_lock from bufdesc and release it, but it makes the code hard to understand. Readers might be confused that why content_lock is not released after CleanVictimBuffer() without further reading CleanVictimBuffer().
>
> I’d suggest pass content_lock to CleanVictimBuffer() as a parameter, which gives a clear hint that CleanVictimBuffer() will release the lock.

I think for this specific patch in the set your idea makes sense.
However, in the later patch to do write combining, I release the
content locks for the batch in CompleteWriteBatchIO() and having the
start buffer's lock separate as a parameter would force me to have a
special case handling this.

I've added a comment to both CleanVictimBuffer() and its caller
specifying that the lock must be held and that it will be released
inside CleanVictimBuffer.

> 5 - 0002
> ```
> * disastrous system-wide consequences. To make sure that can't happen,
> * skip the flush if the buffer isn't permanent.
> */
> - if (buf_state & BM_PERMANENT)
> - XLogFlush(recptr);
> + if (!XLogRecPtrIsInvalid(buffer_lsn))
> + XLogFlush(buffer_lsn);
> ```
>
> Why this check is changed? Should the comment be updated accordingly as it says “if the buffer isn’t permanent”, which reflects to the old code.

It's changed because I split the logic for flushing to that LSN and
determining the LSN across the Prepare and Do functions. This is
needed because when we do batches, we want to flush to the max LSN
across all buffers in the batch.

I check if the buffer is BM_PERMANENT in PrepareFlushBuffer(). You
make a good point about my comment, though. I've moved it to
PrepareFlushBuffer() and updated it.

> 8 - 0003
> ```
> bool
> +strategy_supports_eager_flush(BufferAccessStrategy strategy)
> ```
>
> This function is only used in bufmgr.c, can we move it there and make it static?

BufferAccessStrategyData is opaque to bufmgr.c. Only freelist.c can
access it. I agree it is gross that I have these helpers and functions
that would otherwise be static, though.

> 10 - 0004
> ```
> --- a/src/backend/storage/buffer/bufmgr.c
> +++ b/src/backend/storage/buffer/bufmgr.c
>
> + limit = Min(max_batch_size, limit);
> ```
>
> Do we need to check max_batch_size should be less than (MAX_IO_COMBINE_LIMIT-1)? Because BufWriteBatch.bufdescs is defined with length of MAX_IO_COMBINE_LIMIT, and the first place has been used to store “start”.

I assert that in StrategyMaxWriteBatchSize(). io_combine_limit is not
allowed to exceed MAX_IO_COMBINE_LIMIT, so it shouldn't happen anyway,
since we are capping ourselves at io_combine_limit. Or is that your
point?

> 11 - 0004
> ```
> --- a/src/backend/storage/buffer/bufmgr.c
> +++ b/src/backend/storage/buffer/bufmgr.c
>
> + for (batch->n = 1; batch->n < limit; batch->n++)
> + {
> + Buffer bufnum;
> +
> + if ((bufnum = StrategySweepNextBuffer(strategy)) == InvalidBuffer)
> + break;
> ```
>
> Is sweep next buffer right next to start? If yes, can we assert that? But my guess is no, if my guess is true, then is it possible that bufnum meets start? If that’s true, then we should check next buffer doesn’t equal to start.

Ah, great point. I didn't think about this. Our sweep will always
start right after the start buffer, but then if it goes all the way
around, it will "lap" the start buffer. Because of this and because I
think it is weird to have the sweep variables in the
BufferAccessStrategy object, I've changed my approach in attached v6.
I set sweep_end to be the start block in the batch and then pass
around a sweep cursor variable. Hitting sweep_end is the termination
condition.

> 12 - 0004
> ```
> @@ -4306,19 +4370,22 @@ CleanVictimBuffer(BufferAccessStrategy strategy,
>
> if (from_ring && strategy_supports_eager_flush(strategy))
> {
> + uint32 max_batch_size = max_write_batch_size_for_strategy(strategy);
> ```
>
> I think max_batch_size can be attribute of strategy and set it when creating a strategy, so that we don’t need to calculate in every round of clean.

Actually, the max pin limit can change quite frequently. See
GetAdditionalPinLimit()'s usage in read stream code. If the query is
pinning other buffers in another part of the query, it can change our
limit.

I'm not sure if I should call GetAdditionalPinLImit() for each batch
or for each run of batches (like in StrategyMaxWriteBatchSize()).
Currently, I call it for each batch (in FindFlushAdjacents()). The
read stream calls it pretty frequently (each
read_stream_start_pending_read()). But, in the batch flush case,
nothing could change between batches in a run of batches. So maybe I
should move it up and out and make it per run of batches...

> 13 - 0004
> ```
> +void
> +CompleteWriteBatchIO(BufWriteBatch *batch, IOContext io_context,
> + WritebackContext *wb_context)
> +{
> + ErrorContextCallback errcallback =
> + {
> + .callback = shared_buffer_write_error_callback,
> + .previous = error_context_stack,
> + };
> +
> + error_context_stack = &errcallback;
> + pgBufferUsage.shared_blks_written += batch->n;
> ```
>
> Should we only increase shared_blks_written only after the loop of write-back is done?

On master, FlushBuffer() does it after smgrwrite() (before writeback).
I think pgBufferUsage is mainly used in EXPLAIN (also
pg_stat_statements) which won't be used until the end of the query and
won't be displayed if we error out.

> 14 - 0004
> ```
> --- a/src/backend/storage/buffer/freelist.c
> +++ b/src/backend/storage/buffer/freelist.c
>
> +uint32
> +max_write_batch_size_for_strategy(BufferAccessStrategy strategy)
> ```
>
> I think this function can be moved to bufmgr.c and make it static.

This technically could be moved, but it is a function giving you
information about a strategy which seemed to fit better in freelist.c.

> 18 - 0007
> ```
> --- a/src/backend/storage/buffer/bufmgr.c
> +++ b/src/backend/storage/buffer/bufmgr.c
>
> + max_batch_size = checkpointer_max_batch_size();
> ```
>
> Look like we don’t need to calculate max_batch_size in the for loop.

I don't think it's in the for loop.

- Melanie

Attachment	Content-Type	Size
v6-0002-Split-FlushBuffer-into-two-parts.patch	text/x-patch	8.1 KB
v6-0003-Eagerly-flush-bulkwrite-strategy-ring.patch	text/x-patch	11.1 KB
v6-0005-Fix-XLogNeedsFlush-for-checkpointer.patch	text/x-patch	2.9 KB
v6-0001-Refactor-goto-into-for-loop-in-GetVictimBuffer.patch	text/x-patch	12.8 KB
v6-0004-Write-combining-for-BAS_BULKWRITE.patch	text/x-patch	16.7 KB
v6-0006-Add-database-Oid-to-CkptSortItem.patch	text/x-patch	1.9 KB
v6-0007-Implement-checkpointer-data-write-combining.patch	text/x-patch	11.2 KB

In response to

Re: Checkpointer write combining at 2025-09-10 08:24:13 from Chao Li

Responses

Re: Checkpointer write combining at 2025-09-12 03:33:17 from Chao Li

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Melanie Plageman	2025-09-11 23:16:51	Re: Eagerly evict bulkwrite strategy ring
Previous Message	Noah Misch	2025-09-11 22:42:16	Re: race condition in pg_class