| From: | BharatDB <bharatdbpg(at)gmail(dot)com> |
|---|---|
| To: | Chao Li <li(dot)evan(dot)chao(at)gmail(dot)com> |
| Cc: | Melanie Plageman <melanieplageman(at)gmail(dot)com>, Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de> |
| Subject: | Re: Checkpointer write combining |
| Date: | 2025-10-28 12:51:41 |
| Message-ID: | CAAh00EQ_vL+1TFLSSS5YXYH+KCg=P5ND3-simX3L2c9P536Xrw@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
On Thu, Oct 16, 2025 at 9:55 AM Chao Li <li(dot)evan(dot)chao(at)gmail(dot)com> wrote:
> Hi Milanie,
>
> Thanks for updating the patch set. I review 1-6 and got a few more small
> comments. I didn’t review 0007 as it’s marked as WIP.
>
> >
> > - Melanie
> >
> <v7-0001-Refactor-goto-into-for-loop-in-GetVictimBuffer.patch><v7-0002-Split-FlushBuffer-into-two-parts.patch><v7-0003-Eagerly-flush-bulkwrite-strategy-ring.patch><v7-0004-Write-combining-for-BAS_BULKWRITE.patch><v7-0005-Add-database-Oid-to-CkptSortItem.patch><v7-0006-Implement-checkpointer-data-write-combining.patch><v7-0007-WIP-Refactor-SyncOneBuffer-for-bgwriter-only.patch>
>
> 1 - 0001
> ```
> --- a/src/include/storage/buf_internals.h
> +++ b/src/include/storage/buf_internals.h
> @@ -421,6 +421,12 @@ ResourceOwnerForgetBufferIO(ResourceOwner owner,
> Buffer buffer)
> /*
> * Internal buffer management routines
> */
> +
> +
> +/* Note: these two macros only work on shared buffers, not local ones! */
> ```
>
> Nit: here you added two empty lines, I think we need only 1.
>
> 2 - 0002
> ```
> +static void
> +CleanVictimBuffer(BufferDesc *bufdesc,
> + bool from_ring, IOContext io_context)
> +{
>
> - /* Find smgr relation for buffer */
> - if (reln == NULL)
> - reln = smgropen(BufTagGetRelFileLocator(&buf->tag),
> INVALID_PROC_NUMBER);
> + XLogRecPtr max_lsn = InvalidXLogRecPtr;
> + LWLock *content_lock;
> ```
>
> Nit: the empty line after “{“ should be removed.
>
> 3 - 0003
> ```
> +/*
> + * Return the next buffer in the ring or InvalidBuffer if the current
> sweep is
> + * over.
> + */
> +Buffer
> +StrategySweepNextBuffer(BufferAccessStrategy strategy, int *sweep_cursor)
> +{
> + if (++(*sweep_cursor) >= strategy->nbuffers)
> + *sweep_cursor = 0;
> +
> + return strategy->buffers[*sweep_cursor];
> +}
> ```
>
> Feels the function comment is a bit confusing, because the function code
> doesn’t really perform sweep, the function is just a getter. InvalidBuffer
> just implies the current sweep is over.
>
> Maybe rephrase to something like: “Return the next buffer in the range. If
> InvalidBuffer is returned, that implies the current sweep is done."
>
> 4 - 0003
> ```
> static BufferDesc *
> NextStratBufToFlush(BufferAccessStrategy strategy,
> Buffer sweep_end,
> XLogRecPtr *lsn, int *sweep_cursor)
> ``
>
> “Strat” is confusing. I think it’s the short version of “Strategy”. As
> this is a static function, and other function names all have the whole word
> of “strategy”, why don’t also use the whole word in this function name as
> well?
>
> 5 - 0004
> ```
> +uint32
> +StrategyMaxWriteBatchSize(BufferAccessStrategy strategy)
> +{
> + uint32 max_possible_buffer_limit;
> + uint32 max_write_batch_size;
> + int strategy_pin_limit;
> +
> + max_write_batch_size = io_combine_limit;
> +
> + strategy_pin_limit = GetAccessStrategyPinLimit(strategy);
> + max_possible_buffer_limit = GetPinLimit();
> +
> + max_write_batch_size = Min(strategy_pin_limit,
> max_write_batch_size);
> + max_write_batch_size = Min(max_possible_buffer_limit,
> max_write_batch_size);
> + max_write_batch_size = Max(1, max_write_batch_size);
> + max_write_batch_size = Min(max_write_batch_size, io_combine_limit);
> + Assert(max_write_batch_size < MAX_IO_COMBINE_LIMIT);
> + return max_write_batch_size;
> +}
> ```
>
> This implementation is hard to understand. I tried to simplify it:
> ```
> uint32
> StrategyMaxWriteBatchSize(BufferAccessStrategy strategy)
> {
> int strategy_pin_limit = GetAccessStrategyPinLimit(strategy);
> uint32 max_write_batch_size = Min(GetPinLimit(),
> (uint32)strategy_pin_limit);
>
> /* Clamp to io_combine_limit and enforce minimum of 1 */
> if (max_write_batch_size > io_combine_limit)
> max_write_batch_size = io_combine_limit;
> if (max_write_batch_size == 0)
> max_write_batch_size = 1;
>
> Assert(max_write_batch_size < MAX_IO_COMBINE_LIMIT);
> return max_write_batch_size;
> }
> ```
>
> Best regards,
> --
> Chao Li (Evan)
> HighGo Software Co., Ltd.
> https://www.highgo.com/
>
>
>
> Hello All,
> As per reference to the previous mails, I understood the changes made and
> had tried to replicate the patches into the source code for the bug fix but
> it didn't show any significant bug. Also I ran some verification tests for
> the recent changes related to batched write statistics during checkpoints.
> Below are my observations and results:
>
> Test Setup
>
> -
>
> PostgreSQL version: 19devel (custom build)
> -
>
> OS: Ubuntu Linux
> -
>
> Port: 55432
> -
>
> Database: postgres
> -
>
> Test tool: pgbench
> -
>
> Duration: 120 seconds
> -
>
> Command used: pgbench -c 4 -j 4 -T 120 -p 55432 -d postgres
>
> Log Output
After running the workload, I triggered a manual checkpoint and checked the
latest log entry:
2025-10-28 16:53:05.696 IST [11422] LOG: checkpoint complete:
wrote 1383 buffers (8.4%), wrote 3 SLRU buffers;
write=0.023 s, sync=0.017 s, total=0.071 s;
sync files=8, longest=0.004 s, average=0.003 s;
distance=33437 kB, estimate=308790 kB;
Observations:
Metric
Value
Source
Interpretation
Buffers written
1383
From log
Consistent with moderate workload
Checkpoint write time
0.023 s
From log
Realistic for ~11 MB write
Checkpoint sync time
0.017 s
From log
Reasonable
Total checkpoint time
0.071 s
From log
≈ write + sync + small overhead
CHECKPOINT runtime (psql)
-
-
Fast, confirms idle background activity
The total time closely matches the sum of write and sync times, with only a
small overhead (expected for control file updates).
Checkpoint stats in pg_stat_checkpointer also updated correctly, with no
missing or duplicated values.
Expected behavior observed:
-
write + sync ≈ total (no zero, truncation, or aggregation bug)
-
Buffer counts and timing scale realistically with workload
-
No evidence of under- or over-counted times
-
Checkpoint stats properly recorded in log and pg_stat_checkpointer
Math check:
0.023 s + 0.017 s = 0.040 s, total = 0.071 s => difference ≈ 0.03 s
overhead => normal for control file + metadata writes.
Comparison to Prior Reports:
Test
Pre-Patch
Post-Patch
Difference
Checkpoint duration
6.5 s
5.0 s
−23
Heavy workload test
16 s
8 s
−50
Result: It shows consistent and stable timing even under moderate pgbench
load — confirming the patch is integrated and functioning correctly.
Final Status:
Category
Result
Batched Write Stats Accuracy
Verified OK
Timing Aggregation Correctness
Verified OK
pg_stat_checkpointer Consistency
Verified OK
Log Formatting
Normal
Performance Regression
None detected
*Inference:*
-
The reported write, sync, and total timings are consistent and realistic.
-
No anomalies or timing mismatches were seen.
-
Checkpoint performance seems stable and accurate under moderate load.
-
No regression or counter accumulation issues detected.
Also, I have been verifying output using manual queries recording the
observations and find no significant delays. Observations are recorded
below and the screenshots are attached herewith.
Observations:
Metric
Before Activity
After Activity
Notes
Buffers written
27949
27972
Matches wrote 23 buffers from log
Write time
206882
206887
Matches write=0.005 s from log
Sync time
840
877
Matches sync=0.037 s from log
num_done
6
7
Completed successfully
slru_written
8
9
Matches wrote 1 SLRU buffer from log
Stats reset
yes
yes
successful
num_requested
7
8
manual checkpoint recorded
For further review, I have attached the screenshots of my terminal
herewith.
Kindly review the observations and please let me know if any additional
details need to be focused on.
Thanking you.
Regards,
Soumya
| Attachment | Content-Type | Size |
|---|---|---|
| Screenshot from 2025-10-28 16-54-20.png | image/png | 96.4 KB |
| Screenshot from 2025-10-27 17-17-27.png | image/png | 86.5 KB |
|
image/png | 69.2 KB |
|
image/png | 66.8 KB |
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Alexander Kukushkin | 2025-10-28 13:03:00 | Re: Issue with logical replication slot during switchover |
| Previous Message | vignesh C | 2025-10-28 12:51:04 | Re: Logical Replication of sequences |