From: | Alexey Makhmutov <a(dot)makhmutov(at)postgrespro(dot)ru> |
---|---|
To: | pgsql-hackers(at)postgresql(dot)org |
Subject: | High CPU consumption in cascade replication with large number of walsenders |
Date: | 2025-08-30 23:47:40 |
Message-ID: | 77d94649-e00c-4d56-b2e2-e9d1843131d7@postgrespro.ru |
Views: | Whole Thread | Raw Message | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hello hackers,
This is a continuation of the thread
https://www.postgresql.org/message-id/flat/076eb7bd-52e6-4a51-ba00-c744d027b15c%40postgrespro.ru,
with focus only on the patch related to improving performance in case of
large number of cascaded walsenders.
We’ve faced an interesting situation on a standby environment with
configured cascade replication and large number (~100) of configured
walsenders. We’ve noticed a very high CPU consumption on such
environment with the most time-consuming operation being signal delivery
from startup recovery process to walsenders via WalSndWakeup invocations
from ApplyWalRecord in xlogrecovery.c.
The startup standby process notifies walsenders for downstream systems
using ConditionVariableBroadcast (CV), so only processes waiting on this
CV need to be contacted. However in case of high load we seems to be
hitting here a bottleneck anyway. The current implementation tries to
send notification after processing of each WAL record (i.e. during each
invocation of ApplyWalRecord), so this implies high rate of WalSndWakeup
invocations. At the same time, this also provides each walsender with
very small chunk of data to process, so almost every process will be
present in the CV wait list for the next iteration. As result, waiting
list should be always fully packed in such case, which additionally
reduces performance of WAL records processing by the standby instance.
To reproduce such behavior we could use a simple environment with three
servers: primary instance, attached physical standby and its downstream
server with large number of logical replication subscriptions. Attached
is the synthetic test case (test_scenario.zip) to reproduce this
behavior: script ‘test_prepare.sh’ could be used to create required
environment with test data and ‘test_execute.sh’ script executes
‘pgbench’ tool with simple updates against primary instance to trigger
replication to other servers. With just about 6 clients I could observe
high CPU consumption by the 'startup recovering process' (and it may be
sufficient to completely saturate the CPU on a smaller machine). Please
check the environment properties at the top of these scripts before
running them, as they need to be updated in order to specify location
for installed PG build, target location for database instances creation
and used ports.
After thinking about possible ways to improve such case, we've decided
to implement batching for notification delivery. We try to slightly
postpone sending notification until recovery has applied some number of
messages.This reduces rate of CV notifications and also gives receivers
more data to process, so they may not need to enter the CV wait state so
often. Counting applied records is not difficult, but the tricky part
here is to ensure that we do not postpone notifications for too long in
case of low load. To reduce such delay we use a timer handler, which
sets a timeout flag, which is checked in ProcessStartupProcInterrupts.
This allow us to send signal on timeout if the startup process is
waiting for the arrival of new WAL records (in ReadRecord). The
WalSndWakeup will be invoked either after applying certain number of
messages or after expiration of timeout since last notification. The
notification however may be delayed while record is being applied
(during redo handler invocation from ApplyWalRecord). This could
increase delay for some corner cases with non-trivial WAL records like
‘drop database’, but this should be a rare case and walsender process
have its own limit on the wait time, so the delay won’t be indefinite
even in this case.
The patch introduces two GUCs to control the batching behavior. The
first one controls size of batched messages
('cascade_replication_batch_size') and is set to 0 by default, so the
functionality is effectively disabled. The second one controls timed
delay during batching ('cascade_replication_batch_delay'), which is by
default set to 500ms. The delay is used only if batching is enabled.
With this patch applied we’ve noticed a significant reduction in CPU
consumption while using the synthetic test program mentioned above. It
would be great to hear any thoughts on these observations and fixing
approaches, as well as possible pitfalls of proposed changes.
Thanks,
Alexey
Attachment | Content-Type | Size |
---|---|---|
0001-Implement-batching-for-WAL-records-notification-duri.patch | text/x-patch | 9.0 KB |
test_scenario.zip | application/zip | 2.5 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Richard Guo | 2025-08-30 23:51:24 | Re: Correction of RowMark Removal During Sel-Join Elimination |
Previous Message | Richard Guo | 2025-08-30 23:41:00 | Re: Correction of RowMark Removal During Sel-Join Elimination |