Quick Links

RE: Lock timeouts and unusual spikes in replication lag with logical parallel transaction streaming

From:	"Zhijie Hou (Fujitsu)" <houzj(dot)fnst(at)fujitsu(dot)com>
To:	Zane Duffield <duffieldzane(at)gmail(dot)com>
Cc:	"pgsql-bugs(at)lists(dot)postgresql(dot)org" <pgsql-bugs(at)lists(dot)postgresql(dot)org>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Subject:	RE: Lock timeouts and unusual spikes in replication lag with logical parallel transaction streaming
Date:	2025-08-18 11:48:38
Message-ID:	TY4PR01MB16907007071B97B31420679B19431A@TY4PR01MB16907.jpnprd01.prod.outlook.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-bugs

On Monday, August 18, 2025 4:12 PM Zane Duffield <duffieldzane(at)gmail(dot)com> wrote:
> On Mon, Aug 11, 2025 at 9:28 PM Zhijie Hou (Fujitsu)
> <mailto:houzj(dot)fnst(at)fujitsu(dot)com> wrote:
> > Thank you for reporting the issue. I reviewed the logs and query results but
> > did not find any evidence of a suspicious strong lock. Therefore, if
> > possible, could you please query pg_lock and pg_stat_activity separately?
> > This will allow us to determine whether there are any other wait events,
> > besides strong locks, that might be causing blocking.
>
> We tried reproducing the issue in a test environment and were not completely
> successful; we can reproduce the unusual spikes in replication lag (which
> correspond to restarts of the apply worker), but we cannot reproduce the stalled
> replication that we saw in production once.
>
> I suspect that these issues are related, and that the stalling is simply a more
> severe case of the lagging, so I hope that the extra logs I have collected will
> help diagnose the problem. Attached is a tarball (~17MB, with file extension
> _tgz because gmail wouldn't let me upload a .tgz) which contains a few files:
> 1. confirmed_flush_lsn_lag_graph_2025_08_09.png: a graph of the logical
> replication lag during the original incident, where replication falls behind
> and does not recover until we intervene and alter the subscription to disable
> parallel apply
> 2. confirmed_flush_lsn_lag_graph_2025_08_13.png: the same graph, but during an
> attempt to recreate the issue
> 3. server.log: the subscriber's postgres server log during the attempt to
> recreate the issue
> 4. pg_stat_activity.log: a log of a periodic query of all of pg_stat_activity
> during the attempt to recreate the issue
> 5. pg_locks.log: a log of a periodic query of all of pg_locks during the attempt
> to recreate the issue Please note that the graphs are in UTC, but the logs are
> in UTC+1. Also note that some information has been redacted (server names,
> database identifiers, IP addresses).
>
> Another thing worth noting is that in the first graph, the blue and yellow lines
> correspond to two other subscribers that were configured with "streaming = off".
> You can clearly see that for the period where the green line ("streaming =
> parallel") is going up and down, the other subscribers have a practically flat
> line, only going up at the end before returning to zero.

Thanks for sharing the logs!

I noticed in the logs[1] that both the parallel apply worker and leader apply
worker are restarting at certain intervals, which could be the cause of the lag.
This might be due to a SIGINT triggered by a lock_timeout or statement_timeout,
although it's a bit weried that there are no timeout messages present in the
logs. If my assumption is correct, the behavior is understandable: the parallel
apply worker waits for the leader to send more data for the streamed transaction
by acquiring and waiting on a lock. However, the leader might be occupied with
other transactions, preventing it from sending additional data, which could
potentially lead to a lock timeout.

To confirm this, could you please provide the values you have set for
lock_timeout, statement_timeout (on subscriber), and logical_decoding_work_mem
(on publisher) ?

Additionally, for testing purposes, is it possible to disable these timeouts (by
setting the lock_timeout and statement_timeout GUCs to their default values) in
your testing environment to assess whether the lag still persists? This approach
can help us determine whether the timeouts are causing the lag.

[1]---
Aug 13 03:22:36.259 [2242351]: [952-1] LOG: logical replication parallel apply worker for subscription "sub1" has finished
---

Best Regards,
Hou zj

In response to

RE: Lock timeouts and unusual spikes in replication lag with logical parallel transaction streaming at 2025-08-11 11:27:54 from Zhijie Hou (Fujitsu)

Responses

RE: Lock timeouts and unusual spikes in replication lag with logical parallel transaction streaming at 2025-08-18 11:51:45 from Zhijie Hou (Fujitsu)

Browse pgsql-bugs by date

	From	Date	Subject
Next Message	Zhijie Hou (Fujitsu)	2025-08-18 11:51:45	RE: Lock timeouts and unusual spikes in replication lag with logical parallel transaction streaming
Previous Message	Amit Kapila	2025-08-18 09:26:17	Re: BUG #18988: DROP SUBSCRIPTION locks not-yet-accessed database