walsender timeout on logical replication set

From: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
To: pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: walsender timeout on logical replication set
Date: 2021-09-13 01:31:07
Message-ID: 20210913.103107.813489310351696839.horikyota.ntt@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello.

As reported in [1] it seems that walsender can suffer timeout in
certain cases. It is not clearly confirmed, but I suspect that
there's the case where LogicalRepApplyLoop keeps running the innermost
loop without receiving keepalive packet for longer than
wal_sender_timeout (not wal_receiver_timeout). Of course that can be
resolved by giving sufficient processing power to the subscriber if
not. But if that happens between the servers with the equal processing
power, it is reasonable to "fix" this. Theoretically I think this can
happen with equally-powered servers if the connecting network is
sufficiently fast. Because sending reordered changes is relatively
simple and fast than apllying the changes on subscriber.

I think we don't want to call GetCurrentTimestamp every iteration of
the innermost loop. Even if we call it every N iterations, I don't
come up with a proper N that fits any workload. So one possible
solution would be using slgalrm. Is it worth doing? Or is there any
other way?

Even if we won't fix this, we might need to add a description about
this restriciton in the documentation?

Any thougths?

[1] https://www.postgresql.org/message-id/CAEDsCzhBtkNDLM46_fo_HirFYE2Mb3ucbZrYqG59ocWqWy7-xA%40mail.gmail.com

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Kyotaro Horiguchi 2021-09-13 02:00:04 Re: corruption of WAL page header is never reported
Previous Message Noah Misch 2021-09-13 01:26:33 Re: Remove redundant initializations