RE: Logical replication timeout problem

From: "wangw(dot)fnst(at)fujitsu(dot)com" <wangw(dot)fnst(at)fujitsu(dot)com>
To: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
Cc: "kuroda(dot)hayato(at)fujitsu(dot)com" <kuroda(dot)hayato(at)fujitsu(dot)com>, Peter Smith <smithpb2250(at)gmail(dot)com>, Fabrice Chapuis <fabrice636861(at)gmail(dot)com>, Simon Riggs <simon(dot)riggs(at)enterprisedb(dot)com>, Petr Jelinek <petr(dot)jelinek(at)enterprisedb(dot)com>, "tanghy(dot)fnst(at)fujitsu(dot)com" <tanghy(dot)fnst(at)fujitsu(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Ajin Cherian <itsajin(at)gmail(dot)com>
Subject: RE: Logical replication timeout problem
Date: 2022-03-09 02:26:14
Message-ID: OS3PR01MB62750A1360AB7DF6E8F40A909E0A9@OS3PR01MB6275.jpnprd01.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Mar 8, 2022 at 3:52 PM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
> I've looked at the patch and have a question:
Thanks for your review and comments.

> +void
> +SendKeepaliveIfNecessary(LogicalDecodingContext *ctx, bool skipped) {
> + static int skipped_changes_count = 0;
> +
> + /*
> + * skipped_changes_count is reset when processing changes that do not
> + * need to be skipped.
> + */
> + if (!skipped)
> + {
> + skipped_changes_count = 0;
> + return;
> + }
> +
> + /*
> + * After continuously skipping SKIPPED_CHANGES_THRESHOLD
> changes, try to send a
> + * keepalive message.
> + */
> + #define SKIPPED_CHANGES_THRESHOLD 10000
> +
> + if (++skipped_changes_count >= SKIPPED_CHANGES_THRESHOLD)
> + {
> + /* Try to send a keepalive message. */
> + OutputPluginUpdateProgress(ctx, true);
> +
> + /* After trying to send a keepalive message, reset the flag. */
> + skipped_changes_count = 0;
> + }
> +}
>
> Since we send a keepalive after continuously skipping 10000 changes, the
> originally reported issue can still occur if skipping 10000 changes took more than
> the timeout and the walsender didn't send any change while that, is that right?
Yes, theoretically so.
But after testing, I think this value should be conservative enough not to reproduce
this bug.
After the previous discussion[1], it is currently considered that it is better
to directly set a conservative threshold than to calculate the threshold based
on wal_sender_timeout.

[1] - https://www.postgresql.org/message-id/OS3PR01MB6275FEB9F83081F1C87539B99E019%40OS3PR01MB6275.jpnprd01.prod.outlook.com

Regards,
Wang wei

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message wangw.fnst@fujitsu.com 2022-03-09 02:27:35 RE: Logical replication timeout problem
Previous Message wangw.fnst@fujitsu.com 2022-03-09 02:25:15 RE: Logical replication timeout problem