Re: [PROPOSAL] Doublewrite Buffer as an alternative torn page protection to Full Page Write

From: Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>
To: 陈宗志 <baotiao(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: [PROPOSAL] Doublewrite Buffer as an alternative torn page protection to Full Page Write
Date: 2026-02-16 14:07:16
Message-ID: CAKZiRmyN40=WW27Mnkj_zO3FvYn8fcoFwnQ+a=+W6zymqPr0vQ@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Feb 9, 2026 at 7:53 PM 陈宗志 <baotiao(at)gmail(dot)com> wrote:
>
> Hi hackers,
>
> I raised this topic a while back [1] but didn't get much traction, so
> I went ahead and implemented it: a doublewrite buffer (DWB) mechanism
> for PostgreSQL as an alternative to full_page_writes.
>
> The core argument is straightforward. FPW and checkpoint frequency are
> fundamentally at odds:
>
> - FPW wants fewer checkpoints -- each checkpoint triggers a wave of
> full-page WAL writes for every page dirtied for the first time,
> bloating WAL and tanking write throughput.
> - Fast crash recovery wants more checkpoints -- less WAL to replay
> means the database comes back sooner.
>
> DWB resolves this tension by moving torn page protection out of the
> WAL path entirely. Instead of writing full pages into WAL (foreground,
> latency-sensitive), dirty pages are sequentially written to a
> dedicated doublewrite buffer area on disk before being flushed to
> their actual locations. The buffer is fsync'd once when full, then
> pages are scatter-written to their final positions. On crash recovery,
> intact copies from the DWB repair any torn pages.
>
> Key design differences:
>
> - FPW: 1 WAL write (foreground) + 1 page write = directly impacts SQL latency
> - DWB: 2 page writes (background flush path) = minimal user-visible impact
> - DWB batches fsync() across multiple pages; WAL fsync batching is
> limited by foreground latency constraints
> - DWB decouples torn page protection from checkpoint frequency, so you
> can checkpoint as often as you want without write amplification
>
> I ran sysbench benchmarks (io-bound, --tables=10
> --table_size=10000000) with checkpoint_timeout=30s,
> shared_buffers=4GB, synchronous_commit=on. Each scenario uses a fresh
> database, VACUUM FULL, 60s warmup, 300s run.
>
> Results (TPS):
>
> FPW OFF FPW ON DWB ON
> read_write/32 18,038 7,943 13,009
> read_write/64 24,249 9,533 15,387
> read_write/128 27,801 9,715 15,387
> write_only/32 53,146 18,116 31,460
> write_only/64 57,628 19,589 32,875
> write_only/128 59,454 14,857 33,814
>
> Avg latency (ms):
>
> FPW OFF FPW ON DWB ON
> read_write/32 1.77 4.03 2.46
> read_write/64 2.64 6.71 4.16
> read_write/128 4.60 13.17 9.81
> write_only/32 0.60 1.77 1.02
> write_only/64 1.11 3.27 1.95
> write_only/128 2.15 8.61 3.78
>
> FPW ON drops to ~25% of baseline (FPW OFF). DWB ON holds at ~57%. In
> write-heavy scenarios DWB delivers over 2x the throughput of FPW with
> significantly better latency.
>
> The implementation is here: https://github.com/baotiao/postgres
>
> I'd appreciate any feedback on the approach. Would be great if the
> community could take a look and see if this direction is worth
> pursuing upstream.

Hi Baotiao

I'm a newbie here, but took Your idea with some interest, probably everyone
else is busy with work on other patches before commit freeze.

I think it would be valuable to have this as I've been hit by PostgreSQL's
unsteady (chain-saw-like) WAL traffic, especially related to touching 1st the
pages after checkpoint, up to the point of saturating network links. The common
counter-argument to double buffering is probably that FPI may(?) increase WAL
standby replication rate and this would have to be measured into account
(but we also should take into account how much maintenance_io_concurrency/
posix_fadvise() prefetching that we do today helps avoid any I/O stalls on
fetching pages - so it should be basically free), I see even that you
got benefits
by not using FPI. Interesting.

Some notes/questions about the patches itself:

0. The convention here is send the patches using:
git format-patch -v<VERSION> HEAD~<numberOfpatches>
for easier review. The 0003 probably should be out of scope. Anyway I've
attached all of those so maybe somebody else is going to take a
look at them too,
they look very mature. Is this code used in production already anywhere? (and
BTW the numbers are quite impressive)

1. We have full_page_writes = on/off, but Your's patch adds double_write_buffer
IMHO if we have competing solution it would be better to have something like
io_torn_pages_protection = off | full_pages | double_writes
and maybe we'll be able to add 'atomic_writes' one day.
BTW: once you stabilize the GUC, it is worth adding to postgresql.conf.sample

2. How would one know how to size double_write_buffer_size ?

2b. IMHO the patch could have enriched pg_stat_io with some
information. Please take
a look on pg_stat_io view and functions like pgstat_count_io_op_time() and
their parameters and enums there, that way we could have IOOBJECT_DWBUF maybe
and be able to say how much I/O was attributed to double-buffering, fsync()
times related to it and so on.

3. In DWBufPostCheckpoint() there's pg_usleep(1ms) just before atomic pwrites(),
but exactly why is it necessary to have this literally sched_yield(2) there?

4. In BufferSync() I have doubts if such copying is safe in loop:
page = BufHdrGetBlock(bufHdr);
memcpy(dwb_buf, page, BLCKSZ);
shouldn't there be some form of locking (BUFFER_LOCK_SHARE?)/pinning buffers?
Also it wouldn't be better if that memcpy would be guarded by the
critical section?
(START_CRIT_SECTION)

4b. There seems to be double coping: there's palloc for dwb_buf in BufferSync()
that is filled by memcpy(), and then DWBufWritePage() is called and
then again
that "page" is copied a second time using memcpy(). This seems to be done for
every checkpoint page, so may reduce benefits of this double-buffering code.

4c. Shouldn't this active waiting in DWBufWritePage() shouldn't be achieved
using spinlocks rather than pg_usleep(100us)?

5. Have you maybe verified using injection points (or gdb) if crashing
in several
places really hits that DWBufRecoverPage()? Is there a simple way
of reproducing this
to play with it? (possibly that could be good test on it's own)

6. Quick testing overview (for completeness)
- basic test without even enabling this feature complains about
postgresql.conf.sample
(test_misc/003_check_guc)
- with `PG_TEST_INITDB_EXTRA_OPTS="-c double_write_buffer=on" meson
test` I've
got 3 failures there:
* test_misc/003_check_guc (expected)
* pg_waldump/002_save_fullpage (I would say it's expected)
* pg_walinspect / pg_walinspect/regress (I would say it's expected)

I haven't really got it up and running for real, but at least that's
some start and I hope that helps.

-J.

Attachment Content-Type Size
v1-0003-Add-Claude-Code-configuration.patch text/x-patch 2.2 KB
v1-0001-Add-double-write-buffer-DWB-for-torn-page-protect.patch text/x-patch 26.4 KB
v1-0002-Fix-DWB-process-handling-and-skip-FPW-when-DWB-en.patch text/x-patch 10.0 KB
v1-0004-Fix-critical-correctness-bugs-in-double-write-buf.patch text/x-patch 18.9 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Akshay Joshi 2026-02-16 14:09:46 [PATCH] pgindent truncates last line of files missing a trailing newline
Previous Message Nazir Bilal Yavuz 2026-02-16 13:16:38 Re: Improve docs syntax checking and enable it in the meson build