Quick Links

Re: Logical replication timeout problem

From:	Björn Harrtell <bjorn(dot)harrtell(at)gmail(dot)com>
To:	Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
Cc:	"wangw(dot)fnst(at)fujitsu(dot)com" <wangw(dot)fnst(at)fujitsu(dot)com>, "kuroda(dot)hayato(at)fujitsu(dot)com" <kuroda(dot)hayato(at)fujitsu(dot)com>, Peter Smith <smithpb2250(at)gmail(dot)com>, Fabrice Chapuis <fabrice636861(at)gmail(dot)com>, Simon Riggs <simon(dot)riggs(at)enterprisedb(dot)com>, Petr Jelinek <petr(dot)jelinek(at)enterprisedb(dot)com>, "tanghy(dot)fnst(at)fujitsu(dot)com" <tanghy(dot)fnst(at)fujitsu(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Ajin Cherian <itsajin(at)gmail(dot)com>
Subject:	Re: Logical replication timeout problem
Date:	2022-03-14 22:23:47
Message-ID:	CANhDX=aM7sxDRWmOg6J=jb9pCOGruQ44F3nKKN4uu4EakHbMaA@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Hi, I have been following this discussion for a while because I believe we
are hit by this pretty hard.

This sounds very reasonable to me:

"Why don't we check both the count and the time?
That is, I think we can send a keep-alive either if we skipped 10000
changes or if we didn't sent anything for wal_sender_timeout / 2"

Will gladly test what ends up as an acceptable patch for this, hoping for
the best and thanks for looking into this.

Den ons 9 mars 2022 kl 07:45 skrev Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>:

> On Wed, Mar 9, 2022 at 11:26 AM wangw(dot)fnst(at)fujitsu(dot)com
> <wangw(dot)fnst(at)fujitsu(dot)com> wrote:
> >
> > On Tue, Mar 8, 2022 at 3:52 PM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
> wrote:
> > > I've looked at the patch and have a question:
> > Thanks for your review and comments.
> >
> > > +void
> > > +SendKeepaliveIfNecessary(LogicalDecodingContext *ctx, bool skipped) {
> > > + static int skipped_changes_count = 0;
> > > +
> > > + /*
> > > + * skipped_changes_count is reset when processing changes
> that do not
> > > + * need to be skipped.
> > > + */
> > > + if (!skipped)
> > > + {
> > > + skipped_changes_count = 0;
> > > + return;
> > > + }
> > > +
> > > + /*
> > > + * After continuously skipping SKIPPED_CHANGES_THRESHOLD
> > > changes, try to send a
> > > + * keepalive message.
> > > + */
> > > + #define SKIPPED_CHANGES_THRESHOLD 10000
> > > +
> > > + if (++skipped_changes_count >= SKIPPED_CHANGES_THRESHOLD)
> > > + {
> > > + /* Try to send a keepalive message. */
> > > + OutputPluginUpdateProgress(ctx, true);
> > > +
> > > + /* After trying to send a keepalive message, reset
> the flag. */
> > > + skipped_changes_count = 0;
> > > + }
> > > +}
> > >
> > > Since we send a keepalive after continuously skipping 10000 changes,
> the
> > > originally reported issue can still occur if skipping 10000 changes
> took more than
> > > the timeout and the walsender didn't send any change while that, is
> that right?
> > Yes, theoretically so.
> > But after testing, I think this value should be conservative enough not
> to reproduce
> > this bug.
>
> But it really depends on the workload, the server condition, and the
> timeout value, right? The logical decoding might involve disk I/O much
> to spill/load intermediate data and the system might be under the
> high-load condition. Why don't we check both the count and the time?
> That is, I think we can send a keep-alive either if we skipped 10000
> changes or if we didn't sent anything for wal_sender_timeout / 2.
>
> Also, the patch changes the current behavior of wal senders; with the
> patch, we send keep-alive messages even when wal_sender_timeout = 0.
> But I'm not sure it's a good idea. The subscriber's
> wal_receiver_timeout might be lower than wal_sender_timeout. Instead,
> I think it's better to periodically check replies and send a reply to
> the keep-alive message sent from the subscriber if necessary, for
> example, every 10000 skipped changes.
>
> Regards,
>
> --
> Masahiko Sawada
> EDB: https://www.enterprisedb.com/
>
>
>

In response to

Re: Logical replication timeout problem at 2022-03-09 06:45:01 from Masahiko Sawada

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Kekalainen, Otto	2022-03-14 23:03:50	[PATCH] Fix various spelling errors
Previous Message	Imseih (AWS), Sami	2022-03-14 22:21:50	Re: Add index scan progress to pg_stat_progress_vacuum