Report reorder buffer size

From: Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>
To: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Report reorder buffer size
Date: 2025-08-13 12:39:47
Message-ID: CAExHW5sX4a4GEMo99wcPALHPG9XM+-3ovAyPJ=ZnZAA=ux_EVw@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi All,
The question of how to tune logical_decoding_work_mem has come up a
few times in the context of customer cases. If it is too low then the
changes are spilled to disk, slowing down replication and consuming
disk space. In case of streamed transactions, it won't consume space
on primary but will consume resources (transactional resources/disk
space) on replica. If decoding memory is too high, it may cause memory
pressure especially when there are many wal senders and large amounts
of changes queued to reorder buffers frequently.

In order to tune logical_decoding_work_mem optimally we need a measure
of the size of changes belonging to the transactions queued in the
reorder buffer, whether spilled to the disk or in memory or sent
downstream as part of an unfinished streamed transaction. Currently
there is no way to determine that. pg_stat_replication_slots reports
cumulative statistics about spilled or streamed transactions. It helps
to know whether the logical decoding work memory is sufficient or not
to hold all the changes. But it does not provide a way to estimate an
optimal value for it since it doesn't capture the state of the reorder
buffer at a given point in time.

With the attached WIP patch, we can report, at a given point in time,
total size of reorder buffer i.e. sum of the size of all the changes
belonging to the transactions active in the reorder buffer whether
those changes are in memory, on disk or sent downstream as part of an
unfinished streamed transaction. By sampling this value at regular
intervals, one can observe the reorder buffer trendline as shown in
the plots attached. In the plots, the Y-axis is the total size of
reorder buffer and the X axis is time of day.
rbsize_trendlines_pgbench.png shows trendline when just regular
pgbench is run with 3K TPS and
rbsize_trendlines_large_transaction_spikes.png shows the trendline
when there are long and large transactions in addition to pgbench.
Using these trendlines, a user may decide to keep
logical_decoding_work_mem to minimum and thus let all the large
transactions spill to disk or be streamed. Or they may decide to set
it as 100MB, 200MB or 500MB reducing or eliminating the disk spill or
streaming. It makes it easy to quantify the trade-offs in tuning
logical_decoding_work_mem.

Tracking total size of reorder buffer also helps to quantify load on a
wal sender. It is an additional metric that can be useful to debug WAL
buildup, slow replication etc. For example, the plot clearly and
directly shows when the wal sender started processing the large
transaction, when it finished etc.

The patch is not complete. There are many things to be worked on
1. Whether the metrics should be reported in pg_stat_replication or
pg_stat_replication_slots. Given that it's a point-in-time metric, it
fits pg_stat_replication better. But that view does not report logical
decoding activity happening in backends other than the wal senders.
But that's true with the other metrics in that view as well.
pg_stat_replication_slots, however, covers all users of logical
decoding. For now I have used pg_stat_replication, but it can be
changed based on the discussion here.

2. Is the size of the reorder buffer enough or we want to also track
the size of changes on disk and the size of changes sent downstream as
part of unfinished streamed transactions separately? Also the number
of transactions being tracked by the reorder buffer? If we are going
to report so much statistics about the contents of the reorder buffer,
is it better to have a separate view pg_stat_reorder_buffer for the
same?

3. The patch doesn't cover all the places where the reorder buffer's
total size should be maintained. I need to find all such places and
add required maths.

But before I work on those, I would like to know others' opinions,
usefulness and acceptance. FWIW, attached is a tap test to test the
changes

--
Best Wishes,
Ashutosh Bapat

Attachment Content-Type Size
image/png 42.9 KB
image/png 58.1 KB
0001-Report-size-of-reorder-buffer-contents-20250813.patch text/x-patch 14.0 KB
049_rb_total_size.pl application/x-perl 7.1 KB

Browse pgsql-hackers by date

  From Date Subject
Next Message Nazir Bilal Yavuz 2025-08-13 13:25:27 Re: meson vs. llvm bitcode files
Previous Message Tomas Vondra 2025-08-13 12:15:37 Re: index prefetching