Re: Logical replication timeout problem

From: Fabrice Chapuis <fabrice636861(at)gmail(dot)com>
To: "wangw(dot)fnst(at)fujitsu(dot)com" <wangw(dot)fnst(at)fujitsu(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, "tanghy(dot)fnst(at)fujitsu(dot)com" <tanghy(dot)fnst(at)fujitsu(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Logical replication timeout problem
Date: 2022-01-21 13:17:40
Message-ID: CAA5-nLCaX4Bz2MW-gupH8Ovn+r8KrCSwLW3jiUKcPf555J=5+w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Thanks for your patch, it also works well when executing our use case, the
timeout no longer appears in the logs. Is it necessary now to refine this
patch and make as few changes as possible in order for it to be released?

On Fri, Jan 21, 2022 at 10:51 AM wangw(dot)fnst(at)fujitsu(dot)com <
wangw(dot)fnst(at)fujitsu(dot)com> wrote:

> On Thu, Jan 20, 2022 at 9:18 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
> wrote:
> > It might be not reaching the actual send_keep_alive logic in
> > WalSndKeepaliveIfNecessary because of below code:
> > {
> > ...
> > /*
> > * Don't send keepalive messages if timeouts are globally disabled or
> > * we're doing something not partaking in timeouts.
> > */
> > if (wal_sender_timeout <= 0 || last_reply_timestamp <= 0) return; ..
> > }
> >
> > I think you can add elog before the above return and before updating
> progress
> > in the below code:
> > case REORDER_BUFFER_CHANGE_INSERT:
> > if (!relentry->pubactions.pubinsert)
> > + {
> > + OutputPluginUpdateProgress(ctx);
> > return;
> >
> > This will help us to rule out one possibility.
>
> Thanks for your advices!
>
> According to your advices, I applied 0001,0002 and 0003 to run the test
> script.
> When subscriber timeout, I filter publisher-side log:
> $ grep "before invoking update progress" pub.log | wc -l
> 60373557
> $ grep "return because wal_sender_timeout or last_reply_timestamp" pub.log
> | wc -l
> 0
> $ grep "return because waiting_for_ping_response" pub.log | wc -l
> 0
>
> Based on this result, I think function WalSndKeepaliveIfNecessary was
> invoked,
> but function WalSndKeepalive was not invoked because (last_processing >=
> ping_time) is false.
> So I tried to see changes about last_processing and last_reply_timestamp
> (because ping_time is based on last_reply_timestamp).
>
> I found last_processing and last_reply_timestamp is set in function
> ProcessRepliesIfAny.
> last_processing is set to the time when function ProcessRepliesIfAny is
> invoked.
> Only when publisher receive a response from subscriber,
> last_reply_timestamp is
> set to last_processing and the flag waiting_for_ping_response is reset to
> false.
>
> When we are during the loop to skip all the changes of transaction, IIUC,
> we do
> not invoke function ProcessRepliesIfAny. So I think last_processing and
> last_reply_timestamp will not be changed in this loop.
> Therefore I think about our use case, we should modify the condition of
> invoking WalSndKeepalive.(please refer to
> 0004-Simple-modification-of-timing.patch, and note that this is only a
> patch
> for testing).
> At the same time I modify the input of WalSndKeepalive from true to false.
> This
> is because when input is true, waiting_for_ping_response is set to true in
> WalSndKeepalive. As mentioned above, ProcessRepliesIfAny is not invoked in
> the
> loop, so I think waiting_for_ping_response will not be reset to false and
> keepalive messages will not be sent.
>
> I tested after applying patches(0001 and 0004), I found the timeout was not
> printed in subscriber-side log. And the added messages "begin load
> changes" and
> "commit the log" were printed in publisher-side log:
> $ grep -ir "begin load changes" pub.log
> 2022-01-21 11:17:06.934 CST [2577699] LOG: begin load changes
> $ grep -ir "commit the log" pub.log
> 2022-01-21 11:21:15.564 CST [2577699] LOG: commit the log
>
> Attach the patches and test script mentioned above, in case someone wants
> to
> try.
>
> Regards,
> Wang wei
>

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Shruthi Gowda 2022-01-21 13:40:22 Re: preserving db/ts/relfilenode OIDs across pg_upgrade (was Re: storing an explicit nonce)
Previous Message Amit Kapila 2022-01-21 12:13:12 Re: Skipping logical replication transactions on subscriber side