From: | "Zhijie Hou (Fujitsu)" <houzj(dot)fnst(at)fujitsu(dot)com> |
---|---|
To: | Zane Duffield <duffieldzane(at)gmail(dot)com> |
Cc: | "pgsql-bugs(at)lists(dot)postgresql(dot)org" <pgsql-bugs(at)lists(dot)postgresql(dot)org>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, "Zhijie Hou (Fujitsu)" <houzj(dot)fnst(at)fujitsu(dot)com> |
Subject: | RE: Lock timeouts and unusual spikes in replication lag with logical parallel transaction streaming |
Date: | 2025-08-18 11:51:45 |
Message-ID: | TY4PR01MB16907BBAFB73A4022CB4121089431A@TY4PR01MB16907.jpnprd01.prod.outlook.com |
Views: | Whole Thread | Raw Message | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-bugs |
> On Monday, August 18, 2025 4:12 PM Zane Duffield
> <duffieldzane(at)gmail(dot)com> wrote:
> > On Mon, Aug 11, 2025 at 9:28 PM Zhijie Hou (Fujitsu)
> > <mailto:houzj(dot)fnst(at)fujitsu(dot)com> wrote:
> > > Thank you for reporting the issue. I reviewed the logs and query
> > > results but did not find any evidence of a suspicious strong lock.
> > > Therefore, if possible, could you please query pg_lock and pg_stat_activity
> separately?
> > > This will allow us to determine whether there are any other wait
> > > events, besides strong locks, that might be causing blocking.
> >
> > We tried reproducing the issue in a test environment and were not
> > completely successful; we can reproduce the unusual spikes in
> > replication lag (which correspond to restarts of the apply worker),
> > but we cannot reproduce the stalled replication that we saw in production
> once.
> >
> > I suspect that these issues are related, and that the stalling is
> > simply a more severe case of the lagging, so I hope that the extra
> > logs I have collected will help diagnose the problem. Attached is a
> > tarball (~17MB, with file extension _tgz because gmail wouldn't let me
> upload a .tgz) which contains a few files:
> > 1. confirmed_flush_lsn_lag_graph_2025_08_09.png: a graph of the logical
> > replication lag during the original incident, where replication falls behind
> > and does not recover until we intervene and alter the subscription to
> disable
> > parallel apply
> > 2. confirmed_flush_lsn_lag_graph_2025_08_13.png: the same graph, but
> during an
> > attempt to recreate the issue
> > 3. server.log: the subscriber's postgres server log during the attempt to
> > recreate the issue
> > 4. pg_stat_activity.log: a log of a periodic query of all of pg_stat_activity
> > during the attempt to recreate the issue 5. pg_locks.log: a log of
> > a periodic query of all of pg_locks during the attempt to recreate the
> > issue Please note that the graphs are in UTC, but the logs are in
> > UTC+1. Also note that some information has been redacted (server
> > names, database identifiers, IP addresses).
> >
> > Another thing worth noting is that in the first graph, the blue and
> > yellow lines correspond to two other subscribers that were configured with
> "streaming = off".
> > You can clearly see that for the period where the green line
> > ("streaming =
> > parallel") is going up and down, the other subscribers have a
> > practically flat line, only going up at the end before returning to zero.
>
> Thanks for sharing the logs!
>
> I noticed in the logs[1] that both the parallel apply worker and leader apply
> worker are restarting at certain intervals, which could be the cause of the lag.
> This might be due to a SIGINT triggered by a lock_timeout or statement_timeout,
> although it's a bit weried that there are no timeout messages present in the logs.
> If my assumption is correct, the behavior is understandable: the parallel apply
> worker waits for the leader to send more data for the streamed transaction by
> acquiring and waiting on a lock. However, the leader might be occupied with
> other transactions, preventing it from sending additional data, which could
> potentially lead to a lock timeout.
>
> To confirm this, could you please provide the values you have set for
> lock_timeout, statement_timeout (on subscriber), and
> logical_decoding_work_mem (on publisher) ?
>
> Additionally, for testing purposes, is it possible to disable these timeouts (by
> setting the lock_timeout and statement_timeout GUCs to their default values)
> in your testing environment to assess whether the lag still persists? This
> approach can help us determine whether the timeouts are causing the lag.
>
> [1]---
> Aug 13 03:22:36.259 [2242351]: [952-1] LOG: logical replication parallel apply
> worker for subscription "sub1" has finished
> ---
By the way, if possible, could you share a script that can reproduce the
same log? It will be very helpful for us to debug it.
Best Regards,
Hou zj
From | Date | Subject | |
---|---|---|---|
Next Message | PG Bug reporting form | 2025-08-18 13:32:24 | BUG #19023: Table DDL default column expression depending on temp objects |
Previous Message | Zhijie Hou (Fujitsu) | 2025-08-18 11:48:38 | RE: Lock timeouts and unusual spikes in replication lag with logical parallel transaction streaming |