Quick Links

Lock timeouts and unusual spikes in replication lag with logical parallel transaction streaming

From:	Zane Duffield <duffieldzane(at)gmail(dot)com>
To:	pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject:	Lock timeouts and unusual spikes in replication lag with logical parallel transaction streaming
Date:	2025-08-11 06:20:56
Message-ID:	CACMiCkXyC4au74kvE2g6Y=mCEF8X6r-Ne_ty4r7qWkUjRE4+oQ@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-bugs

Hi all,

In testing out the subscription parameter "streaming" we've noticed some
unusual spikes in replay_lsn lag (and also confirmed_flush_lsn lag) when
using "streaming = parallel".
During the periods of unusual lag (when compared to a separate "streaming =
off" subscriber) we also see some additional logging from the (replica)
postgres server about waiting for / acquiring locks, as well as error
messages about logical apply workers exiting.

Then, today we found the logical replica stuck with a lag of about 30GB.
The logs revealed a looping lock timeout with the parallel apply workers,
the relevant section of which is attached.

While these workers were stuck, we queried the pg_locks catalog table in an
attempt to collect more information. The query itself, and the resulting
data (run with "\watch 1") are attached.

What I see in the output is that the lead apply worker (pid 2857187) is
blocking the parallel worker (pid 2857523), then later another parallel
worker (pid 2861846) is also blocking the lead apply worker (pid 2857187).
What I can't see, is anything blocking pid 2861846, and the logs don't show
much about this worker (only that it starts and then is terminated). It's
not clear to me whether this is a deadlock, or a case where we need to
increase the lock_timeout (we didn't try).

At this point, I don't have a description of the upstream load that leads
to this issue, and I don't have a reproducer script

One thing I should mention is that the schema of the publisher and
subscriber are completely identical; the subscriber was created using
pg_createsubscriber, and has not had any DDL applied (except for temp
tables) since. I know that there is a risk of deadlocks in the case of
schema differences, but there certainly isn't an extra index, constraint,
or anything like that on the subscriber.

Given the poor performance and issues we are having with the "streaming =
parallel" option, we will likely be sticking with the "streaming = on"
option, which successfully worked to reduce our logical replication lag.
I noticed that the "parallel" option is slated to become the default for
new subscriptions in PostgreSQL 18, and I wonder whether this is wise,
given the potential for locking issues; in our replication cluster, several
distinct scheduled jobs lead to the parallel streaming logical replication
falling far behind the non-streaming version.

PostgreSQL version and some relevant configuration information:

version
----------------------------------------------------------------------------------------------------------
PostgreSQL 17.4 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 11.5.0
20240719 (Red Hat 11.5.0-5), 64-bit

shared_buffers
----------------
98172MB

work_mem
----------
6GB

max_worker_processes
----------------------
36

max_logical_replication_workers
---------------------------------
8

max_parallel_apply_workers_per_subscription
---------------------------------------------
8

Thanks,
Zane

Attachment	Content-Type	Size
pg_locks_query_results.txt	text/plain	275.9 KB
logical_apply_lock_issue.log	application/octet-stream	7.2 KB

Responses

Re: Lock timeouts and unusual spikes in replication lag with logical parallel transaction streaming at 2025-08-11 10:55:20 from Amit Kapila
RE: Lock timeouts and unusual spikes in replication lag with logical parallel transaction streaming at 2025-08-11 11:27:54 from Zhijie Hou (Fujitsu)

Browse pgsql-bugs by date

	From	Date	Subject
Next Message	양주희	2025-08-11 09:09:09	Re: BUG #19014: Automatic aggressive VACUUM on template0 and template1 pg_shdepend runs every minute
Previous Message	Maximilian Bosch	2025-08-10 10:54:47	Re: BUG #18991: NUMA test fails on machine with two NUMA nodes when building 18beta2