Re: Lock timeouts and unusual spikes in replication lag with logical parallel transaction streaming

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Zane Duffield <duffieldzane(at)gmail(dot)com>
Cc: pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: Lock timeouts and unusual spikes in replication lag with logical parallel transaction streaming
Date: 2025-08-11 10:55:20
Message-ID: CAA4eK1Kn_B4UUiFMnZ6G50tRz2PRD7XSX8_B3CKkaMTXqtiwoQ@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On Mon, Aug 11, 2025 at 11:51 AM Zane Duffield <duffieldzane(at)gmail(dot)com> wrote:
>
> In testing out the subscription parameter "streaming" we've noticed some unusual spikes in replay_lsn lag (and also confirmed_flush_lsn lag) when using "streaming = parallel".
> During the periods of unusual lag (when compared to a separate "streaming = off" subscriber) we also see some additional logging from the (replica) postgres server about waiting for / acquiring locks, as well as error messages about logical apply workers exiting.
>
> Then, today we found the logical replica stuck with a lag of about 30GB. The logs revealed a looping lock timeout with the parallel apply workers, the relevant section of which is attached.
>
> While these workers were stuck, we queried the pg_locks catalog table in an attempt to collect more information. The query itself, and the resulting data (run with "\watch 1") are attached.
>
> What I see in the output is that the lead apply worker (pid 2857187) is blocking the parallel worker (pid 2857523), then later another parallel worker (pid 2861846) is also blocking the lead apply worker (pid 2857187).
> What I can't see, is anything blocking pid 2861846, and the logs don't show much about this worker (only that it starts and then is terminated). It's not clear to me whether this is a deadlock, or a case where we need to increase the lock_timeout (we didn't try).
>

It is not the case of deadlock. Increasing lock_timeout could only
help if the parallel worker (2861846) is slow but still processing the
changes but if it is stuck in some other wait like LWLock then it
won't help. The key part of the puzzle is what exactly the process
2861846 is busy with? You can either share more server LOGs for that
or it would be better to increase the LOG level to DEBUG-1/2 to see if
we can get some more information about process 2861846.

The other wait you see for worker 2857523 is normal as it is waiting
for the next set of changes from the leader worker after processing
its current set.

> At this point, I don't have a description of the upstream load that leads to this issue, and I don't have a reproducer script
>

Please see, if you can get that because the current set of information
is not sufficient to establish that it is the problem of parallel
apply.

> One thing I should mention is that the schema of the publisher and subscriber are completely identical; the subscriber was created using pg_createsubscriber, and has not had any DDL applied (except for temp tables) since. I know that there is a risk of deadlocks in the case of schema differences, but there certainly isn't an extra index, constraint, or anything like that on the subscriber.
>
> Given the poor performance and issues we are having with the "streaming = parallel" option, we will likely be sticking with the "streaming = on" option, which successfully worked to reduce our logical replication lag.
> I noticed that the "parallel" option is slated to become the default for new subscriptions in PostgreSQL 18, and I wonder whether this is wise, given the potential for locking issues; in our replication cluster, several distinct scheduled jobs lead to the parallel streaming logical replication falling far behind the non-streaming version.
>

It is worth considering but we should at least first figure out what
is going wrong. It could be some minor bug either in parallel-apply
infrastructure or a pre-existing one which is exposed due to
parallel-apply activity.

--
With Regards,
Amit Kapila.

In response to

Browse pgsql-bugs by date

  From Date Subject
Next Message Zhijie Hou (Fujitsu) 2025-08-11 11:27:54 RE: Lock timeouts and unusual spikes in replication lag with logical parallel transaction streaming
Previous Message 양주희 2025-08-11 09:09:09 Re: BUG #19014: Automatic aggressive VACUUM on template0 and template1 pg_shdepend runs every minute