RE: Logical replication timeout problem

From: "kuroda(dot)hayato(at)fujitsu(dot)com" <kuroda(dot)hayato(at)fujitsu(dot)com>
To: "wangw(dot)fnst(at)fujitsu(dot)com" <wangw(dot)fnst(at)fujitsu(dot)com>
Cc: "tanghy(dot)fnst(at)fujitsu(dot)com" <tanghy(dot)fnst(at)fujitsu(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Simon Riggs <simon(dot)riggs(at)enterprisedb(dot)com>, Petr Jelinek <petr(dot)jelinek(at)enterprisedb(dot)com>, Fabrice Chapuis <fabrice636861(at)gmail(dot)com>
Subject: RE: Logical replication timeout problem
Date: 2022-02-24 08:06:29
Message-ID: TYAPR01MB586664ABD1C699C643CA032DF53D9@TYAPR01MB5866.jpnprd01.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Dear Wang,

Thank you for teaching some backgrounds about the patch.

> According to our discussion, we need to send keepalive messages to subscriber
> when skipping changes.
> One approach is that **for each skipped change**, we try to send keepalive
> message by calculating whether a timeout will occur based on the current time
> and the last time the keepalive was sent. But this will brings slight overhead.
> So I want to try another approach: after **constantly skipping some changes**,
> we try to send keepalive message by calculating whether a timeout will occur
> based on the current time and the last time the keepalive was sent.

You meant that calling system calls like GetCurrentTimestamp() should be reduced,
right? I'm not sure how it affects but it seems reasonable.

> IMO, we should send keepalive message after skipping a certain number of
> changes constantly.
> And I want to calculate the threshold dynamically by using a fixed value to
> avoid adding too much code.
> In addition, different users have different machine performance, and users can
> modify wal_sender_timeout, so the threshold should be dynamically calculated
> according to wal_sender_timeout.

Your experiment seems not bad, but the background cannot be understand from
code comments. I prefer a static threshold because it's more simple,
which as Amit said in the following too:

https://www.postgresql.org/message-id/CAA4eK1%2B-p_K_j%3DNiGGD6tCYXiJH0ypT4REX5PBKJ4AcUoF3gZQ%40mail.gmail.com

> In the existing code, similar operations on wal_sender_timeout use the style of
> (wal_sender_timeout / 2), e.g. function WalSndKeepaliveIfNecessary. So I think
> it should be consistent in this patch.
> But I think it is better to use magic number too. Maybe we could improve it in
> a new thread.

I confirmed the code and +1 yours. We should treat it in another thread if needed.

BTW, this patch cannot be applied to current master.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Kyotaro Horiguchi 2022-02-24 08:27:03 Re: [BUG] Panic due to incorrect missingContrecPtr after promotion
Previous Message Masahiko Sawada 2022-02-24 07:50:19 Re: Optionally automatically disable logical replication subscriptions on error