Quick Links

Re: Checkpointer write combining

From:	Chao Li <li(dot)evan(dot)chao(at)gmail(dot)com>
To:	Melanie Plageman <melanieplageman(at)gmail(dot)com>
Cc:	Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de>
Subject:	Re: Checkpointer write combining
Date:	2026-01-14 07:49:19
Message-ID:	0D990493-9DF4-4136-B1A1-CA0112E809ED@gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

> On Jan 14, 2026, at 06:24, Melanie Plageman <melanieplageman(at)gmail(dot)com> wrote:
>
> Attached v12 fixes a variety of buglets I found throughout the patch set.
>
> I've also done quite a bit of refactoring. The scope of the
> refactoring varies from inlining some helpers to introducing new input
> argument structs.
>
> 0001 is independently valuable as it optimizes StrategyRejectBuffer()
> a bit and makes GetVictimBuffer() cleaner
>
> 0002-0007 were largely present in older versions of the patch set
>
> 0008 is new -- it is an early version of batching for normal backends
> flushing a buffer to obtain a clean one. Right now, it checks if the
> two blocks succeeding the target block are in shared buffers and
> dirty, and, if so, writes them out with the target buffer. I haven't
> started testing or benchmarking it because I need to convert bgwriter
> to use write combining to be able to benchmark it effectively. But I
> thought I would get the code out there sooner rather than later.
>
> It's a lot harder with my current code structure to add the target
> block's predecessor if it is dirty and read to write out. I wonder how
> important this is vs just the two succeeding blocks.
>
> - Melanie
> <v12-0001-Streamline-buffer-rejection-for-bulkreads-of-unl.patch><v12-0002-Split-FlushBuffer-into-two-parts.patch><v12-0003-Eagerly-flush-bulkwrite-strategy-ring.patch><v12-0004-Write-combining-for-BAS_BULKWRITE.patch><v12-0005-Add-database-Oid-to-CkptSortItem.patch><v12-0006-Implement-checkpointer-data-write-combining.patch><v12-0007-Refactor-SyncOneBuffer-for-bgwriter-use-only.patch><v12-0008-Eagerly-flush-buffer-successors.patch>

Hi Melanie,

I went through the patch set again today, and paid special attention to 0001 and 0008 that I seemed to not review before. Here are comments I got today:

1 - 0001
```
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -15,6 +15,7 @@
#ifndef BUFMGR_INTERNALS_H
#define BUFMGR_INTERNALS_H

+#include "access/xlogdefs.h"
```

I tried to build without adding this include, build still passed. I think that’s because there is a include path: storage/bufmgr.h -> storage/bufpage.h -> access/xlogger.h.

So, maybe we can remove this include.

2 - 0001
```
+ * The buffer must be pinned and content locked and the buffer header spinlock
+ * must not be held.
+ *
* Returns true if buffer manager should ask for a new victim, and false
* if this buffer should be written and re-used.
*/
bool
StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf, bool from_ring)
{
+ XLogRecPtr lsn;
+
+ if (!strategy)
+ return false;
```

As the new comment stated, the buffer must be pinned, maybe we can add an assert to ensure that:
```
Assert(BufferIsPinned(buffer));
```

Similarly, maybe we can also assert the locks are not held:
```
Assert(BufferDescriptorGetContentLock(buffer));
Assert(!pg_atomic_read_u32(&buf->state) & BM_LOCKED);
```

3 - 0001 - bufmgr.c - BufferNeedsWALFlush()
```
+ buffer = BufferDescriptorGetBuffer(bufdesc);
+ page = BufferGetPage(buffer);
+
+ Assert(BufferIsValid(buffer));
```

I think the Assert should be moved to before "page = BufferGetPage(buffer);”.

4 - 0001 - bufmgr.c
```
+/*
+ * Returns true if the buffer needs WAL flushed before it can be written out.
+ * *lsn is set to the current page LSN.
+ *
+ * If the result is required to be correct, the caller must hold a buffer
+ * content lock. If they only hold a shared content lock, we'll need to
+ * acquire the buffer header spinlock, so they must not already hold it.
+ *
+ * If the buffer is unlogged, *lsn shouldn't be used by the caller and is set
+ * to InvalidXLogRecPtr.
+ */
+bool
+BufferNeedsWALFlush(BufferDesc *bufdesc, bool exclusive, XLogRecPtr *lsn)
```

I think the “exclusive" parameter is a bit subtle. The parameter is not explicitly explained in the header comment, though there is a paragraph that explains different behaviors when the caller hold and not hold content lock. Maybe we can rename to a more direct name: hold_exclusive_content_lock, or a shorter one hold_content_lock.

5 - 0002 - commit message
```
actual buffer flushing step. This separation procides symmetry with
```

Typo: procides -> provides

6 - 0002
```
+ * We must hold the buffer header lock when examining the page LSN since
+ * don't have buffer exclusively locked in all cases.
```

Looks like it’s better to add “we” between “since” and “don’t” => since we don’t have ...

7 - 0002
```
+/*
+ * Prepare the buffer with bufdesc for writing. Returns true if the buffer
+ * actually needs writing and false otherwise. lsn returns the buffer's LSN if
+ * the table is logged.
+ */
+static bool
+PrepareFlushBuffer(BufferDesc *bufdesc, XLogRecPtr *lsn)
+{
uint32 buf_state;

/*
@@ -4425,42 +4445,16 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
* someone else flushed the buffer before we could, so we need not do
* anything.
*/
- if (!StartBufferIO(buf, false, false))
- return;
-
- /* Setup error traceback support for ereport() */
- errcallback.callback = shared_buffer_write_error_callback;
- errcallback.arg = buf;
- errcallback.previous = error_context_stack;
- error_context_stack = &errcallback;
-
- /* Find smgr relation for buffer */
- if (reln == NULL)
- reln = smgropen(BufTagGetRelFileLocator(&buf->tag), INVALID_PROC_NUMBER);
-
- TRACE_POSTGRESQL_BUFFER_FLUSH_START(BufTagGetForkNum(&buf->tag),
- buf->tag.blockNum,
- reln->smgr_rlocator.locator.spcOid,
- reln->smgr_rlocator.locator.dbOid,
- reln->smgr_rlocator.locator.relNumber);
-
- buf_state = LockBufHdr(buf);
-
- /*
- * Run PageGetLSN while holding header lock, since we don't have the
- * buffer locked exclusively in all cases.
- */
- recptr = BufferGetLSN(buf);
+ if (!StartBufferIO(bufdesc, false, false))
+ return false;

- /* To check if block content changes while flushing. - vadim 01/17/97 */
- UnlockBufHdrExt(buf, buf_state,
- 0, BM_JUST_DIRTIED,
- 0);
+ *lsn = InvalidXLogRecPtr;
+ buf_state = LockBufHdr(bufdesc);
```

The header comment says “lsn returns the buffer's LSN if the table is logged”, which looks inaccurate, because if StartBufferIO() is true, the function returns early without setting *lsn.

8 - 0008 - comment message
```
When flushing a dirty buffer, check if it the two blocks following it
```

Nit: “if it the” -> “if the"

9 - 0008
```
+ * Check if the blocks after my block are in shared buffers and dirty and if
+ * it is, write them out too
```

Nit: “if it is” -> “if they are”

10 - 0008
```
+ BlockNumber max_batch_size = 3; /* we only look for two successors */
```

Using type BlockNumber for batch size seems odd. I would suggest change to uint32.

11 - 0008 - WriteBatchInit()
```
+ LockBufHdr(batch_start);
+ batch->max_lsn = BufferGetLSN(batch_start);
+ UnlockBufHdr(batch_start);
```

Should we check unlogged buffer before assigning max_lsn? In previous commits, we have done that in many places.

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/

In response to

Re: Checkpointer write combining at 2026-01-13 22:24:57 from Melanie Plageman

Responses

Re: Checkpointer write combining at 2026-01-14 23:35:54 from Melanie Plageman

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Bertrand Drouvot	2026-01-14 07:57:29	Re: Safer hash table initialization macro
Previous Message	Tom Lane	2026-01-14 07:33:20	Re: Remove redundant assignment in CreateWorkExprContext