Re: Lowering the default wal_blocksize to 4K

From: Andres Freund <andres(at)anarazel(dot)de>
To: Andy Pogrebnoi <andrew(dot)pogrebnoi(at)percona(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Robert Haas <robertmhaas(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>
Subject: Re: Lowering the default wal_blocksize to 4K
Date: 2026-02-16 21:13:37
Message-ID: x56pxq6jpuftn6ear3uwbz5tyb4r37itv5zetvcucq3td57apc@yq7jyka7tszv
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2026-02-16 10:04:37 +0200, Andy Pogrebnoi wrote:
> > On Oct 10, 2023, at 02:08, Andres Freund <andres(at)anarazel(dot)de> wrote:
> >
> > Hi,
> >
> > I've mentioned this to a few people before, but forgot to start an actual
> > thread. So here we go:
> >
> > I think we should lower the default wal_blocksize / XLOG_BLCKSZ to 4096, from
> > the current 8192.
>
> I prepared a patch in case we want to move with the default 4kb XLOG_BLCKSZ.

I think we should.

> Regarding reducing the page headers' size, the benefits of 4Kb wal_blocks
> outweight disadvantages of the proportionally bigger header in my opinion.

I agree.

> Since we recycle WAL segments, the added size won't go to the disk usage but
> rather cause a bit more freqent segment.

I don't think that's a valid argument though, how much WAL needs to be
archived is a relevant factor.

> > One thing I noticed is that our auto-configuration of wal_buffers leads to
> > different wal_buffers settings for different XLOG_BLCKSZ, which doesn't seem
> > great.
>
> I don't think it's an issue as wal_buffers are in block units, not bytes. Even
> though the auto-tuned number may change, the total amount of bytes still remains
> the same with different XLOG_BLCKSZ.

Given the way the auto-tuning works, I don't think that's true:

/*
* Auto-tune the number of XLOG buffers.
*
* The preferred setting for wal_buffers is about 3% of shared_buffers, with
* a maximum of one XLOG segment (there is little reason to think that more
* is helpful, at least so long as we force an fsync when switching log files)
* and a minimum of 8 blocks (which was the default value prior to PostgreSQL
* 9.1, when auto-tuning was added).
*
* This should not be called until NBuffers has received its final value.
*/
static int
XLOGChooseNumBuffers(void)
{
int xbuffers;

xbuffers = NBuffers / 32;
if (xbuffers > (wal_segment_size / XLOG_BLCKSZ))
xbuffers = (wal_segment_size / XLOG_BLCKSZ);
if (xbuffers < 8)
xbuffers = 8;
return xbuffers;
}

If NBuffers / 32 < wal_segment_size / XLOG_BLCKSZ, the chosen xbuffers value
does not depend on XLOG_BLCKSZ.

To me the code only makes sense if you assume that NBuffers / 32 gives you a
value in the same domain as data blocks, otherwise NBuffers / 32 is not the
approximation of %3 that the comment talks about.

I think the code just needs to be fixed to multiply NBuffers * BLCKSZ and then
divide that by XLOG_BLCKSZ.

>
> > For some example numbers, I ran a very simple insert workload with a varying
> > number of clients with both a wal_blocksize=4096 and wal_blocksize=8192
> > cluster, and measured the amount of bytes written before/after.
>
> I've also run some simple tests on my local machine (Ubuntu in Vagrant on M1
> Mac). I run a sysbench write-only load for 20s with different amounts of threads
> (and tables equal to the number of threads num) and measured disk writes with
> iostat. I recreated tables and did a checkpoint before each run. These are my
> results:
>
> 8Kb XLOG_BLCKSZ
> ====
> Threads tps kB_wrtn
> 1 535.34 207288
> 5 1457.24 591708
> 10 1441.85 574700
> 15 823.98 388732
>
> 4Kb XLOG_BLCKSZ
> ====
> Threads tps kB_wrtn
> 1 542.02 153544
> 5 1556.83 393444
> 10 1288.00 339648
> 15 975.32 255708

The reduction in bytes written is rather impressive...

> I will run more benchmarks on proper hardware. For example, interesting what
> happens to performance with >4K writes. But what else do you think has to be
> done to move this patch forward?

I think the auto-tuning bit above needs to be fixed, and it's probably worth
manually testing a pg_upgrade from 8kB XLOG_BLCKSZ to 4kB. It should work, but
...

I think we otherwise should just go for it.

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Nathan Bossart 2026-02-16 21:15:42 Re: pg_upgrade: transfer pg_largeobject_metadata's files when possible
Previous Message Zsolt Parragi 2026-02-16 21:05:11 Re: Small improvements to substring()