From: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
---|---|
To: | Zane Duffield <duffieldzane(at)gmail(dot)com> |
Cc: | "Zhijie Hou (Fujitsu)" <houzj(dot)fnst(at)fujitsu(dot)com>, "pgsql-bugs(at)lists(dot)postgresql(dot)org" <pgsql-bugs(at)lists(dot)postgresql(dot)org> |
Subject: | Re: Lock timeouts and unusual spikes in replication lag with logical parallel transaction streaming |
Date: | 2025-08-21 04:26:26 |
Message-ID: | CAA4eK1Jy5BwZmr5Sp50Q5C+jAbfRfS_e3tFyuzyyJd6CYtgncw@mail.gmail.com |
Views: | Whole Thread | Raw Message | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-bugs |
On Wed, Aug 20, 2025 at 11:08 AM Zane Duffield <duffieldzane(at)gmail(dot)com> wrote:
>>
>> > On Monday, August 18, 2025 4:12 PM Zane Duffield
>> > <duffieldzane(at)gmail(dot)com> wrote:
>> > > On Mon, Aug 11, 2025 at 9:28 PM Zhijie Hou (Fujitsu)
>> > > <mailto:houzj(dot)fnst(at)fujitsu(dot)com> wrote:
>
>
> Yes, I think it is the cause of the lag (every peak lines up directly with a restart of the apply workers), but I'm not sure how it relates to the complete stall shown in confirmed_flush_lsn_lag_graph_2025_08_09.png (attached again).
>
>>
>> > This might be due to a SIGINT triggered by a lock_timeout or statement_timeout,
>> > although it's a bit weried that there are no timeout messages present in the logs.
>> > If my assumption is correct, the behavior is understandable: the parallel apply
>> > worker waits for the leader to send more data for the streamed transaction by
>> > acquiring and waiting on a lock. However, the leader might be occupied with
>> > other transactions, preventing it from sending additional data, which could
>> > potentially lead to a lock timeout.
>> >
>> > To confirm this, could you please provide the values you have set for
>> > lock_timeout, statement_timeout (on subscriber), and
>> > logical_decoding_work_mem (on publisher) ?
>
>
> lock_timeout = 30s
> statement_timeout = 4h
> logical_decoding_work_mem = 64MB
>
>>
>> >
>> > Additionally, for testing purposes, is it possible to disable these timeouts (by
>> > setting the lock_timeout and statement_timeout GUCs to their default values)
>> > in your testing environment to assess whether the lag still persists? This
>> > approach can help us determine whether the timeouts are causing the lag.
>
>
> This was a good question. See the attached confirmed_flush_lsn_lag_graph_2025_08_19.png.
> After setting lock_timeout to zero, the periodic peaks of lag were eliminated, and the restarts of the apply workers in the log are also eliminated.
>
So, this was the reason. As explained by Hou-San, in his previous
response, such a lock_timeout can lead to parallel apply worker exit
while waiting for more data from the leader. I think you need to
either set lock_timeout as 0 or set it to a higher value similar to
what you set for statement_timeout.
>
> One other thing I wonder is whether autovacuum on the subscriber has anything to do with the lock timeouts. I'm not sure whether this could explain the perpetually-restarting apply workers that we witnessed on 2025-08-09, though.
>
No, as per my understanding it is because parallel apply worker
exiting due to lock_timeout set in the test. Ideally, the patch
proposed by Kuroda-San should show in LOGs that the parallel worker is
exiting due to lock_timeout. Can you try that once?
--
With Regards,
Amit Kapila.
From | Date | Subject | |
---|---|---|---|
Next Message | shveta malik | 2025-08-21 04:29:08 | Re: Unexpected Standby Shutdown on sync_replication_slots change |
Previous Message | Dilip Kumar | 2025-08-21 03:46:20 | Re: BUG #18988: DROP SUBSCRIPTION locks not-yet-accessed database |