Re: Logical replication timeout problem

From: Fabrice Chapuis <fabrice636861(at)gmail(dot)com>
To: "wangw(dot)fnst(at)fujitsu(dot)com" <wangw(dot)fnst(at)fujitsu(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, "tanghy(dot)fnst(at)fujitsu(dot)com" <tanghy(dot)fnst(at)fujitsu(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Logical replication timeout problem
Date: 2022-01-21 17:15:18
Message-ID: CAA5-nLDskH8jrshBLXv_dhAsR2aE9ofUhcbskg=SeDzercP=JA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I keep your patch 0001 and I add these two calls in function
WalSndUpdateProgress without modifying WalSndKeepaliveIfNecessary, it works
too.
What do your think of this patch?

static void
WalSndUpdateProgress(LogicalDecodingContext *ctx, XLogRecPtr lsn,
TransactionId xid)
{
static TimestampTz sendTime = 0;
TimestampTz now = GetCurrentTimestamp();

ProcessRepliesIfAny();
WalSndKeepaliveIfNecessary();

/*
* Track lag no more than once per
WALSND_LOGICAL_LAG_TRACK_INTERVAL_MS to
* avoid flooding the lag tracker when we commit frequently.
*/
...
Regards

Fabrice

On Fri, Jan 21, 2022 at 2:17 PM Fabrice Chapuis <fabrice636861(at)gmail(dot)com>
wrote:

> Thanks for your patch, it also works well when executing our use case, the
> timeout no longer appears in the logs. Is it necessary now to refine this
> patch and make as few changes as possible in order for it to be released?
>
> On Fri, Jan 21, 2022 at 10:51 AM wangw(dot)fnst(at)fujitsu(dot)com <
> wangw(dot)fnst(at)fujitsu(dot)com> wrote:
>
>> On Thu, Jan 20, 2022 at 9:18 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
>> wrote:
>> > It might be not reaching the actual send_keep_alive logic in
>> > WalSndKeepaliveIfNecessary because of below code:
>> > {
>> > ...
>> > /*
>> > * Don't send keepalive messages if timeouts are globally disabled or
>> > * we're doing something not partaking in timeouts.
>> > */
>> > if (wal_sender_timeout <= 0 || last_reply_timestamp <= 0) return; ..
>> > }
>> >
>> > I think you can add elog before the above return and before updating
>> progress
>> > in the below code:
>> > case REORDER_BUFFER_CHANGE_INSERT:
>> > if (!relentry->pubactions.pubinsert)
>> > + {
>> > + OutputPluginUpdateProgress(ctx);
>> > return;
>> >
>> > This will help us to rule out one possibility.
>>
>> Thanks for your advices!
>>
>> According to your advices, I applied 0001,0002 and 0003 to run the test
>> script.
>> When subscriber timeout, I filter publisher-side log:
>> $ grep "before invoking update progress" pub.log | wc -l
>> 60373557
>> $ grep "return because wal_sender_timeout or last_reply_timestamp"
>> pub.log | wc -l
>> 0
>> $ grep "return because waiting_for_ping_response" pub.log | wc -l
>> 0
>>
>> Based on this result, I think function WalSndKeepaliveIfNecessary was
>> invoked,
>> but function WalSndKeepalive was not invoked because (last_processing >=
>> ping_time) is false.
>> So I tried to see changes about last_processing and last_reply_timestamp
>> (because ping_time is based on last_reply_timestamp).
>>
>> I found last_processing and last_reply_timestamp is set in function
>> ProcessRepliesIfAny.
>> last_processing is set to the time when function ProcessRepliesIfAny is
>> invoked.
>> Only when publisher receive a response from subscriber,
>> last_reply_timestamp is
>> set to last_processing and the flag waiting_for_ping_response is reset to
>> false.
>>
>> When we are during the loop to skip all the changes of transaction, IIUC,
>> we do
>> not invoke function ProcessRepliesIfAny. So I think last_processing and
>> last_reply_timestamp will not be changed in this loop.
>> Therefore I think about our use case, we should modify the condition of
>> invoking WalSndKeepalive.(please refer to
>> 0004-Simple-modification-of-timing.patch, and note that this is only a
>> patch
>> for testing).
>> At the same time I modify the input of WalSndKeepalive from true to
>> false. This
>> is because when input is true, waiting_for_ping_response is set to true in
>> WalSndKeepalive. As mentioned above, ProcessRepliesIfAny is not invoked
>> in the
>> loop, so I think waiting_for_ping_response will not be reset to false and
>> keepalive messages will not be sent.
>>
>> I tested after applying patches(0001 and 0004), I found the timeout was
>> not
>> printed in subscriber-side log. And the added messages "begin load
>> changes" and
>> "commit the log" were printed in publisher-side log:
>> $ grep -ir "begin load changes" pub.log
>> 2022-01-21 11:17:06.934 CST [2577699] LOG: begin load changes
>> $ grep -ir "commit the log" pub.log
>> 2022-01-21 11:21:15.564 CST [2577699] LOG: commit the log
>>
>> Attach the patches and test script mentioned above, in case someone wants
>> to
>> try.
>>
>> Regards,
>> Wang wei
>>
>

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2022-01-21 17:19:12 Re: New developer papercut - Makefile references INSTALL
Previous Message Robert Haas 2022-01-21 16:49:12 Re: New developer papercut - Makefile references INSTALL