| From: | Melanie Plageman <melanieplageman(at)gmail(dot)com> |
|---|---|
| To: | Chao Li <li(dot)evan(dot)chao(at)gmail(dot)com> |
| Cc: | Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de> |
| Subject: | Re: Checkpointer write combining |
| Date: | 2026-06-19 22:13:17 |
| Message-ID: | CAAKRu_ZoKTpnmQ3MmmQXTpG7msHrP3jNWC68sf8R82-8cgb7Kw@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
Attached is v15. I've done substantial work since my last post.
The biggest difference is that regular backends and background writer
write combining try to combine up to io_combine_limit sized write IOs
by looking for preceding and following blocks to their target block
that are in shared buffers and are dirty. I've found this to
substantially improve performance.
For a regular pgbench scale 500 with 8 clients and 4GB SB on a machine
with IOPS limited to 7500, pgbench TPS improves by 30% (I tuned
backend_flush_after to 512kB). I think in an environment where buffers
and IOPS are scarce, those extra buffers getting cleaned each time is
very helpful. It also improves the p99 latency by 75%. If you turn off
synchronous commit, it is actually almost a 70% increase in TPS
(because over 50% of the IO is WAL with sync commit on). In the stats
I collect, you see that with my patch backends and bgwriter are doing
larger writes and this is the source of the improvement.
I did several more realistic (than tpc-b) benchmarks -- including
multiple clients inserting 1000 rows at a time and saw similar
speedups. I isolated the different combining types and it seems that
the biggest benefit is coming from backends doing their own combining.
If you tune everything so the bgwriter is doing most of the cleaning
of buffers, there is still a benefit, but it is hard to get the
bgwriter sufficiently aggressive. checkpointer has little effect here
because the kernel was already combining those writes.
I did see some benefit in workloads with a fast, local SSD, but the
biggest benefit was in an IOPs constrained environment (which makes
sense). I'll note that I've lost the ability to reproduce the
performance gain with parallel COPY FROMs.
v15 also adds TAP tests of the write combining behavior.
I haven't enabled and started testing write combining for vacuum and bulkread.
The patchset still needs review and some polish. There are a number of
outstanding questions mostly around different function APIs and a bit
around some of the control flow.
I substantially refactored GetVictimBuffer() in
v15-0004-Simplify-victim-buffer-selection and
v15-0005-Refactor-victim-buffer-selection-and-add-helpers. I have
questions about some of the control flow changes I made including
whether or not we should keep looking in the strategy ring once we
find a buffer that has a usagecount > 1 (the current behavior in
master that I've preserved in my patch).
Another question is if EagerCleanStrategyBuffer() should change how it
advances the cursor to avoid calling PrepareOrRejectEagerFlushBuffer()
twice for a buffer that was rejected for reasons other than being
non-contiguous. I think that requirement worsens
PrepareOrRejectEagerFlushBuffer()'s API and potentially eliminate some
buffers we want to revisit, however without it, we may end up having
to reject the same buffers twice.
I've also incorporated Chao's review feedback. Here are my inline replies:
On Wed, Feb 25, 2026 at 1:53 AM Chao Li <li(dot)evan(dot)chao(at)gmail(dot)com> wrote:
>
> A few comments for v14:
Thanks for continuing to review!
> 1 - 0001
> ```
> - FlushBuffer(buf, reln, IOOBJECT_RELATION, IOCONTEXT_NORMAL);
> + FlushBuffer(buf, reln, io_object, io_context);
> ```
>
> This changes the hardcode IOOBJECT_RELATION to io_object when calling FlushBuffer(). But FlushBuffer() itself still uses IOOBJECT_RELATION instead of io_object, so the change will actually not take effective:
> ```
> pgstat_count_io_op_time(IOOBJECT_RELATION, io_context,
> IOOP_WRITE, io_start, 1, BLCKSZ);
> ```
>
> So, an update in FlushBuffer() is also needed.
I believe we did already fix this. However, because FlushLocalBuffer()
exists, FlushBuffer() doesn't need to take an IOObject. And definitely
FlushUnlockedBuffer() does not need to -- since it takes an lock and
thus wouldn't be used for local buffers. I wonder if it is worth the
code churn to change these to avoid any confusion.
FlushRelationBuffers(), FlushRelationAllBuffers(),
FlushDatabaseBuffers(), and FlushOneBuffer(),
EvictUnpinnedBufferInternal() could stand to take an IOContext, though
they don't need it for their current callers, so I won't add the extra
code churn.
> 2 - 0003 - freelist.c
> ```
> bool
> StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf, bool from_ring)
> {
> + Assert(strategy);
> +
> /* We only do this in bulkread mode */
> if (strategy->btype != BAS_BULKREAD)
> return false;
> @@ -795,8 +800,14 @@ StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf, bool from_r
> strategy->buffers[strategy->current] != BufferDescriptorGetBuffer(buf))
> return false;
>
> + Assert(BufferIsLockedByMe(BufferDescriptorGetBuffer(buf)));
> + Assert(!(pg_atomic_read_u64(&buf->state) & BM_LOCKED));
> ```
>
> I don’t quite understand Assert(!(pg_atomic_read_u64(&buf->state) & BM_LOCKED));
>
> The comment says “the buffer header spinlock must not be held,” but I doubt this Assert actually ensures that. Since BM_LOCKED is just a bit in a shared state, isn't it possible for a concurrent backend to grab the lock right when we’re checking it?
>
> If a concurrent process locks the header a split-second before the Assert, we’ll get a false-positive crash in a dev environment even though the code is fine. If they lock it a split-second after, then the Assert didn't really catch anything. It feels like a race condition either way.
>
> I think the only way to truly ensure a spinlock is "not held" is to actually acquire it. If we’re just checking the bit like this, it’s just a snapshot that might be stale by the time the instruction finishes. What was the specific worry here—a self-deadlock, or something else?
I ended up changing StrategyRejectBuffer() substantially since then
and it no longer has this assert. However, this assert is more of a
sanity check. It is present for example in UnlockBufHdr().
> +typedef enum BufferUsageCountChange
> +{
> + BUC_ZERO,
> + BUC_MAX_ONE,
> + BUC_ONE,
> +} BufferUsageCountChange;
> ```
>
> Can we add some brief comments to explain every enum item? I feel hard to guess the meaning from the names without further reading the code.
Done
> +static BufferDesc *
> +PrepareOrRejectEagerFlushBuffer(Buffer bufnum)
> ```
>
> This header comment looks stale.
>
> * It says “and with BM_IO_IN_PROGRESS set”, but I don’t see where BM_IO_IN_PROGRESS is set in this function.
> * It says “also return its LSN”, but I don’t see LSN is returned.
Fixed
> + /* Start IO on the first buffer */
> + if (!StartBufferIO(buf_hdr, false, false))
> + goto clean;
> ```
>
> This failure branch feels like leaking the content lock. The buffer was already share-locked via BufferLockConditional() earlier, and I don’t see the “clean" path unlock that content lock before returning.
I think I've found and fixed anything like this.
> + FlushUnlockedBuffer(bufHdr, NULL, IOOBJECT_RELATION, IOCONTEXT_NORMAL);
> + UnpinBuffer(bufHdr);
> +
> + ScheduleBufferTagForWriteback(wb_context, IOCONTEXT_NORMAL, &bufHdr->tag);
> ```
>
> bufHdr is unlocked and unpinned, it might be unsafe to still use bufHdr->tag. I think we can copy bufHdr->tag to a local variable before unlock the buffer.
Fixed.
On Wed, Feb 25, 2026 at 11:24 PM Chao Li <li(dot)evan(dot)chao(at)gmail(dot)com> wrote:
>
> - * batch_limit is the largest batch we are allowed to construct given the
> - * remaining blocks in the table, the number of available pins, and the
> - * current configuration.
> + * max_batch_size is the maximum number of blocks that can be combined into a
> + * single write in general. This function, based on the block number of start,
> + * will determine the maximum IO size for this particular write given how much
> + * of the file remains. max_batch_size is provided by the caller so it doesn't
> + * have to be recalculated for each write.
> ```
>
> I don’t see the function FindStrategyFlushAdjacents() has a parameter named max_batch_size, but batch_limit is still there.
Fixed.
> static void
> +FindFlushAdjacents(BufferDesc *batch_start,
> + uint32 batch_limit,
> + BufferWriteBatch *batch)
> +{
> + BufferTag require; /* requested block's buffer tag */
> + uint32 newHash; /* hash value for require */
> + LWLock *newPartitionLock; /* buffer partition lock for it */
> + int buf_id;
> +
> + /* create a tag so we can lookup the buffers */
> + InitBufferTag(&require, &batch->reln->smgr_rlocator.locator,
> + batch->forkno, InvalidBlockNumber);
> +
> + for (; batch->n < batch_limit; batch->n++)
> + {
> + XLogRecPtr lsn;
> +
> + require.blockNum = batch->start + batch->n;
> +
> + Assert(BlockNumberIsValid(require.blockNum));
> ```
>
> When I first time read the Assert, I was confused how it can assume the block to be valid, then I realized that WriteBatchInit() ensures a safe “limit”. Maybe add a brief comment for the Assert.
Added comment.
> + batch->bufdescs[batch->n] =
> + PrepareOrRejectEagerFlushBuffer(buf_id + 1,
> + &require,
> + newPartitionLock,
> + &lsn);
> + if (lsn > batch->max_lsn)
> + batch->max_lsn = lsn;
> +
> + if (batch->bufdescs[batch->n] == NULL)
> + break;
> ```
>
> I think we can move if (batch->bufdescs[batch->n] == NULL) to before if (lsn > batch->max_lsn).
>
> This is a correctness issue, because PrepareOrRejectEagerFlushBuffer will set lsn to InvalidXLogRecPtr when returns NULL, but doing the NULL check early just feels more reasonable.
Fixed.
> + * max_lsn may be updated if the provided buffer LSN exceeds the current max
> + * LSN.
> */
> static BufferDesc *
> PrepareOrRejectEagerFlushBuffer(Buffer bufnum,
> BufferTag *require,
> + LWLock *buftable_lock,
> XLogRecPtr *lsn)
> ```
>
> The function doesn’t have a parameter named max_lsn, is it “lsn”?
Fixed.
> +FindFlushAdjacents(BufferDesc *batch_start,
> + uint32 batch_limit,
> + BufferWriteBatch *batch)
> ```
>
> Looks like batch_start is not used at all in this function.
>
> 12 - 0011
> ```
> + * Callers specify if and by how much they want to bump the buffer's usage
> + * count.
> ```
>
> I don’t get what this comment means.
I removed it.
On Fri, Jan 23, 2026 at 7:17 AM Soumya S Murali
<soumyamurali(dot)work(at)gmail(dot)com> wrote:
>
> Batch size = 8
> LOG: checkpoint complete: wrote 12622 buffers (77.0%); write=0.113 s,
> sync=0.195 s, total=0.485 s; sync files=37
> DEBUG: checkpoint BufferSync stats: buffers_written=9923, writeback_calls=1242
> Avg: 7.989 approx 8 buffers per writeback.
>
> Batch size = 16
> LOG: checkpoint complete: wrote 13537 buffers (82.6%); write=0.260 s,
> sync=0.211 s, total=0.625 s; sync files=3
> DEBUG: checkpoint BufferSync stats: buffers_written=6196, writeback_calls=389
> Avg: 15.9 approx 16 buffers per writeback.
>
> Batch size = 32
> LOG: checkpoint complete: wrote 12914 buffers (78.8%); write=0.116 s,
> sync=0.136 s, total=0.442 s; sync files=5
> DEBUG: checkpoint BufferSync stats: buffers_written=12914, writeback_calls=1616
> Avg: 7.99 approx 8 buffers per writeback.
>
> Batch 16 significantly reduces sync fan-out (as low as 3 files per
> checkpoint), but this comes at the cost of longer individual sync
> operations, resulting in higher total checkpoint time (≈0.625 s).
> Batch 32 provides a better balance, maintaining low sync fragmentation
> while avoiding long sync stalls, yielding the lowest overall
> checkpoint time (≈0.442 s). I am attaching the patch with batch size
> fixed as 32 for now for further review.
> Please let me know if further workloads or instrumentation would be useful.
FWIW, I found batch size 16 to be optimal in experimenting with my own
patchset as well.
- Melanie
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Masahiko Sawada | 2026-06-19 22:33:21 | Add a hook for handling logical decoding messages on subscribers. |
| Previous Message | Zsolt Parragi | 2026-06-19 20:42:56 | Re: More jsonpath methods: translate, split, join |