From: | Zane Duffield <duffieldzane(at)gmail(dot)com> |
---|---|
To: | pgsql-bugs(at)lists(dot)postgresql(dot)org |
Subject: | Lock timeouts and unusual spikes in replication lag with logical parallel transaction streaming |
Date: | 2025-08-11 06:20:56 |
Message-ID: | CACMiCkXyC4au74kvE2g6Y=mCEF8X6r-Ne_ty4r7qWkUjRE4+oQ@mail.gmail.com |
Views: | Whole Thread | Raw Message | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-bugs |
Hi all,
In testing out the subscription parameter "streaming" we've noticed some
unusual spikes in replay_lsn lag (and also confirmed_flush_lsn lag) when
using "streaming = parallel".
During the periods of unusual lag (when compared to a separate "streaming =
off" subscriber) we also see some additional logging from the (replica)
postgres server about waiting for / acquiring locks, as well as error
messages about logical apply workers exiting.
Then, today we found the logical replica stuck with a lag of about 30GB.
The logs revealed a looping lock timeout with the parallel apply workers,
the relevant section of which is attached.
While these workers were stuck, we queried the pg_locks catalog table in an
attempt to collect more information. The query itself, and the resulting
data (run with "\watch 1") are attached.
What I see in the output is that the lead apply worker (pid 2857187) is
blocking the parallel worker (pid 2857523), then later another parallel
worker (pid 2861846) is also blocking the lead apply worker (pid 2857187).
What I can't see, is anything blocking pid 2861846, and the logs don't show
much about this worker (only that it starts and then is terminated). It's
not clear to me whether this is a deadlock, or a case where we need to
increase the lock_timeout (we didn't try).
At this point, I don't have a description of the upstream load that leads
to this issue, and I don't have a reproducer script
One thing I should mention is that the schema of the publisher and
subscriber are completely identical; the subscriber was created using
pg_createsubscriber, and has not had any DDL applied (except for temp
tables) since. I know that there is a risk of deadlocks in the case of
schema differences, but there certainly isn't an extra index, constraint,
or anything like that on the subscriber.
Given the poor performance and issues we are having with the "streaming =
parallel" option, we will likely be sticking with the "streaming = on"
option, which successfully worked to reduce our logical replication lag.
I noticed that the "parallel" option is slated to become the default for
new subscriptions in PostgreSQL 18, and I wonder whether this is wise,
given the potential for locking issues; in our replication cluster, several
distinct scheduled jobs lead to the parallel streaming logical replication
falling far behind the non-streaming version.
PostgreSQL version and some relevant configuration information:
version
----------------------------------------------------------------------------------------------------------
PostgreSQL 17.4 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 11.5.0
20240719 (Red Hat 11.5.0-5), 64-bit
shared_buffers
----------------
98172MB
work_mem
----------
6GB
max_worker_processes
----------------------
36
max_logical_replication_workers
---------------------------------
8
max_parallel_apply_workers_per_subscription
---------------------------------------------
8
Thanks,
Zane
Attachment | Content-Type | Size |
---|---|---|
pg_locks_query_results.txt | text/plain | 275.9 KB |
logical_apply_lock_issue.log | application/octet-stream | 7.2 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | 양주희 | 2025-08-11 09:09:09 | Re: BUG #19014: Automatic aggressive VACUUM on template0 and template1 pg_shdepend runs every minute |
Previous Message | Maximilian Bosch | 2025-08-10 10:54:47 | Re: BUG #18991: NUMA test fails on machine with two NUMA nodes when building 18beta2 |