Lowering the default wal_blocksize to 4K

From: Andres Freund <andres(at)anarazel(dot)de>
To: pgsql-hackers(at)postgresql(dot)org, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Robert Haas <robertmhaas(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>
Subject: Lowering the default wal_blocksize to 4K
Date: 2023-10-09 23:08:05
Message-ID: 20231009230805.funj5ipoggjyzjz6@awork3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

I've mentioned this to a few people before, but forgot to start an actual
thread. So here we go:

I think we should lower the default wal_blocksize / XLOG_BLCKSZ to 4096, from
the current 8192. The reason is that

a) We don't gain much from a blocksize above 4096, as we already do one write
all the pending WAL data in one go (except when at the tail of
wal_buffers). We *do* incur more overhead for page headers, but compared to
the actual WAL data it is not a lot (~0.29% of space is page headers 8192
vs 0.59% with 4096).

b) Writing 8KB when we we have to flush a partially filled buffer can
substantially increase write amplification. In a transactional workload,
this will often double the write volume.

Currently disks mostly have 4096 bytes as their "sector size". Sometimes
that's exposed directly, sometimes they can also write in 512 bytes, but that
internally requires a read-modify-write operation.

For some example numbers, I ran a very simple insert workload with a varying
number of clients with both a wal_blocksize=4096 and wal_blocksize=8192
cluster, and measured the amount of bytes written before/after. The table was
recreated before each run, followed by a checkpoint and the benchmark. Here I
ran the inserts only for 15s each, because the results don't change
meaningfully with longer runs.

With XLOG_BLCKSZ=8192

clients tps disk bytes written
1 667 81296
2 739 89796
4 1446 89208
8 2858 90858
16 5775 96928
32 11920 115351
64 23686 135244
128 46001 173390
256 88833 239720
512 146208 335669

With XLOG_BLCKSZ=4096

clients tps disk bytes written
1 751 46838
2 773 47936
4 1512 48317
8 3143 52584
16 6221 59097
32 12863 73776
64 25652 98792
128 48274 133330
256 88969 200720
512 146298 298523

This is on a not-that-fast NVMe SSD (Samsung SSD 970 PRO 1TB).

It's IMO quite interesting that even at the higher client counts, the number
of bytes written don't reach parity.

On a stripe of two very fast SSDs:

With XLOG_BLCKSZ=8192

clients tps disk bytes written
1 23786 2893392
2 38515 4683336
4 63436 4688052
8 106618 4618760
16 177905 4384360
32 254890 3890664
64 297113 3031568
128 299878 2297808
256 308774 1935064
512 292515 1630408

With XLOG_BLCKSZ=4096

clients tps disk bytes written
1 25742 1586748
2 43578 2686708
4 62734 2613856
8 116217 2809560
16 200802 2947580
32 269268 2461364
64 323195 2042196
128 317160 1550364
256 309601 1285744
512 292063 1103816

It's fun to see how the total number of writes *decreases* at higher
concurrency, because it becomes more likely that pages are filled completely.

One thing I noticed is that our auto-configuration of wal_buffers leads to
different wal_buffers settings for different XLOG_BLCKSZ, which doesn't seem
great.

Performing the same COPY workload (1024 files, split across N clients) for
both settings shows no performance difference, but a very slight increase in
total bytes written (about 0.25%, which is roughly what I'd expect).

Personally I'd say the slight increase in WAL volume is more than outweighed
by the increase in throughput and decrease in bytes written.

There's an alternative approach we could take, which is to write in 4KB
increments, while keeping 8KB pages. With the current format that's not
obviously a bad idea. But given there aren't really advantages in 8KB WAL
pages, it seems we should just go for 4KB?

Greetings,

Andres Freund

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Thomas Munro 2023-10-09 23:11:36 Re: Proposal to use JSON for Postgres Parser format
Previous Message Michael Paquier 2023-10-09 22:37:49 Re: Add a new BGWORKER_BYPASS_ROLELOGINCHECK flag