Proposal: Allow walsenders to send WAL directly from wal_buffers to replicas

From: Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>
To: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Proposal: Allow walsenders to send WAL directly from wal_buffers to replicas
Date: 2022-09-01 12:11:44
Message-ID: CALj2ACXCSM+sTR=5NNRtmSQr3g1Vnr-yR91azzkZCaCJ7u4d4w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

walsenders currently read WAL data from disk to send it to all
replicas (standbys or subscribers connected via streaming or logical
replication respectively). This means that walsenders have to wait
until the WAL data is flushed to the disk. There are a few issues with
this approach:

1. IO saturation on the primary. The amount of read IO required for
all walsenders combined can be huge given the sheer number of
walsenders typically present at any given point of time in production
environments (e.g. for high availability, disaster recovery,
read-replicas or subscribers) and life cycle of WAL senders is usually
longer (one maintains replicas for a long period of time in
production). For example, a quick 30 minute pgbench run with 1
primary, 1 async standby, and 1 sync standby shows that 35 GB of WAL
is read from disk on the primary with 3.3 million times for 2
walsenders [3].
2. Increased query response times, particularly for synchronous
standbys, because of WAL flush at primary and standbys usually happen
at different times.
3. Increased replication lag, especially if the WAL data is read from
disk despite it being present in wal_buffers at times.

To improve these issues, I’m proposing that, whenever possible, to let
walsenders send WAL directly from wal_buffers to replicas before it is
flushed to disk. This idea is also noted elsewhere [1]. Standbys can
choose to store the received WAL in wal_buffers (note that the
wal_buffers in standbys are allocated but not used until the
promotion) and flush if they are full OR store WAL directly to disk,
bypassing wal_buffers, but replay only up the flush LSN sent by
primary. Logical subscribers can choose to not apply the WAL beyond
the flush LSN sent by the primary. This approach has the following
advantages:

1. Reduces disk IO or read system calls on the primary.
2. Reduces replication lag.
3. Enables better use of allocated wal_buffers on the standbys.
4. Enables parallel flushing of WAL to disks on both primary and standbys.
5. Disallows async standbys or subscribers getting ahead of the sync
standbys, discussed in the thread at [1], reducing efforts required
during failovers.

This approach has a couple of challenges:

1. Increases stress on wal_buffers - right now there are no readers
for wal_buffers on the primary. This could be problematic if there are
both many concurrent readers and concurrent writers.
2. wal_buffers hit ratio can be low for write-heavy workloads. In this
case disk reads are inevitable.
3. Requires a change to replication protocol. We might have to send
flush LSN to replicas and receive their flush LSN as an
acknowledgement.
4. Requires careful design for replicas not to replay beyond the
received flush LSN. For example, what happens if the wal_buffers get
full, should we write the WAL to disk? What happens if the primary or
replicas crash? Will they have to get the unwritten, lost WAL present
in wal_buffers again?

I would like to summarize the whole work into the following 3
independent items and focus individually on each of them:

1. Allow walsenders to read WAL directly from wal_buffers when
possible - initial patches and results will be posted soon. This has
its own advantages, the comment [2] talks about it.
2. Allow WAL writes and flush to disk happen nearly at the same time
both at primary and standbys.
3. Disallow async standbys or subscribers getting ahead of the sync standbys.

Thoughts?

[1] https://www.postgresql.org/message-id/20220309020123.sneaoijlg3rszvst%40alap3.anarazel.de
[2] https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/backend/access/transam/xlogreader.c;h=f17e80948d17ff0e2e92fd1677d1a0da06778fc7;hb=7fed801135bae14d63b11ee4a10f6083767046d8#l1457
[3] shared_buffers = 8GB
max_wal_size = 32GB
checkpoint_timeout = 15min
track_wal_io_timing = on
wal_buffers = 16MB (auto-tuned values, not manually set)
Ubuntu VM: c5.4xlarge - AWS EC2 instance
RAM: 32GB
VCores: 16
SSD: 512GB

./pgbench —initialize —scale=300 postgres
./pgbench —jobs=16 —progress=300 —client=32 —time=1800 —username=ubuntu postgres

-[ RECORD 1 ]------------+---------------
application_name | async_standby1
wal_read | 1685714
wal_read_bytes | 17726209880
wal_read_time | 7746.622
-[ RECORD 2 ]------------+---------------
application_name | sync_standby1
wal_read | 1685771
wal_read_bytes | 17726209880
wal_read_time | 6002.679

--
Bharath Rupireddy
RDS Open Source Databases: https://aws.amazon.com/rds/postgresql/

Browse pgsql-hackers by date

  From Date Subject
Next Message Daniel Gustafsson 2022-09-01 12:14:52 Re: [PATCH v1] fix potential memory leak in untransformRelOptions
Previous Message Bruce Momjian 2022-09-01 12:03:59 Re: Tracking last scan time