RE: Logical replication timeout problem

From: "wangw(dot)fnst(at)fujitsu(dot)com" <wangw(dot)fnst(at)fujitsu(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Ajin Cherian <itsajin(at)gmail(dot)com>
Cc: Fabrice Chapuis <fabrice636861(at)gmail(dot)com>, Simon Riggs <simon(dot)riggs(at)enterprisedb(dot)com>, Petr Jelinek <petr(dot)jelinek(at)enterprisedb(dot)com>, "tanghy(dot)fnst(at)fujitsu(dot)com" <tanghy(dot)fnst(at)fujitsu(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: RE: Logical replication timeout problem
Date: 2022-02-28 07:40:51
Message-ID: OS3PR01MB6275FEB9F83081F1C87539B99E019@OS3PR01MB6275.jpnprd01.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Feb 22, 2022 at 4:56 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
Thanks for your review.

> On Tue, Feb 22, 2022 at 9:17 AM wangw(dot)fnst(at)fujitsu(dot)com
> <wangw(dot)fnst(at)fujitsu(dot)com> wrote:
> >
> > On Fri, Feb 18, 2022 at 10:51 AM Ajin Cherian <itsajin(at)gmail(dot)com> wrote:
> > > Some comments:
> > Thanks for your review.
> >
> > > I see you only track skipped Inserts/Updates and Deletes. What about
> > > DDL operations that are skipped, what about truncate.
> > > What about changes made to unpublished tables? I wonder if you could
> > > create a test script that only did DDL operations
> > > and truncates, would this timeout happen?
> > According to your suggestion, I tested with DDL and truncate.
> > While testing, I ran only 20,000 DDLs and 10,000 truncations in one
> > transaction.
> > If I set wal_sender_timeout and wal_receiver_timeout to 30s, it will time out.
> > And if I use the default values, it will not time out.
> > IMHO there should not be long transactions that only contain DDL and
> > truncation. I'm not quite sure, do we need to handle this kind of use case?
> >
>
> I think it is better to handle such cases as well and changes related
> to unpublished tables as well. BTW, it seems Kuroda-San has also given
> some comments [1] which I am not sure are addressed.
Add handling of related use cases.

> I think instead of keeping the skipping threshold w.r.t
> wal_sender_timeout, we can use some conservative number like 10000 or
> so which we are sure won't impact performance and won't lead to
> timeouts.
Yes, it would be better. Set the threshold conservatively to 10000.

> *
> + /*
> + * skipped_changes_count is reset when processing changes that do not need
> to
> + * be skipped.
> + */
> + skipped_changes_count = 0
>
> When the skipped_changes_count is reset, the sendTime should also be
> reset. Can we reset it whenever the UpdateProgress function is called
> with send_keep_alive as false?
Fixed.

Attached a new patch that addresses following improvements I have got so far as
comments:
1. Consider other changes that need to be skipped(truncate, DDL and function
calls pg_logical_emit_message). [suggestion by Ajin, Amit]
(Add a new function SendKeepaliveIfNecessary for trying to send keepalive message.)
2. Set the threshold conservatively to a static value of 10000.[suggestion by Amit, Kuroda-San]
3. Reset sendTime in function WalSndUpdateProgress when send_keep_alive is
false. [suggestion by Amit]

Regards,
Wang wei

Attachment Content-Type Size
0001-Fix-the-timeout-of-subscriber-in-long-transactions.patch application/octet-stream 12.1 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message wangw.fnst@fujitsu.com 2022-02-28 07:42:34 RE: Logical replication timeout problem
Previous Message Julien Rouhaud 2022-02-28 07:26:07 Re: support for MERGE