Re: walsender timeout on logical replication set

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
Cc: abhishek(dot)bhola(at)japannext(dot)co(dot)jp, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: walsender timeout on logical replication set
Date: 2021-09-17 10:16:48
Message-ID: CAA4eK1KK4bmTx6A6TGkcaESwzQfxgOvafdLtFfSoCsY6jwWfzg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Sep 17, 2021 at 12:48 PM Kyotaro Horiguchi
<horikyota(dot)ntt(at)gmail(dot)com> wrote:
>
> Thank you vary much for coming in!
>
> At Fri, 17 Sep 2021 10:18:11 +0530, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote in
> > On Mon, Sep 13, 2021 at 7:01 AM Kyotaro Horiguchi
> > <horikyota(dot)ntt(at)gmail(dot)com> wrote:
> > >
> > > Hello.
> > >
> > > As reported in [1] it seems that walsender can suffer timeout in
> > > certain cases. It is not clearly confirmed, but I suspect that
> > > there's the case where LogicalRepApplyLoop keeps running the innermost
> > > loop without receiving keepalive packet for longer than
> > > wal_sender_timeout (not wal_receiver_timeout).
> >
> > Why is that happening? In the previous investigation in this area [1]
> > your tests revealed that after reading a WAL page, we always send keep
> > alive, so even if the transaction is large, we should send some
> > keepalive in-between.
>
> We fixed too-many keepalives (aka keepalive-flood) in the thread, but
> this is an issue of long absense of subscriber response. What I'm
> suspecting, or assuming here is:
>
> - The publisher is working fine. It doesn't send extra keepalives so
> much and does send regular keepalives with wal_sender_timeout/2 by
> the sender's clock.
>

I think the publisher should also send it even after a certain amount
of WAL is consumed via the below code:
WalSndWaitForWal()
{
...
if (MyWalSnd->flush < sentPtr &&
MyWalSnd->write < sentPtr &&
!waiting_for_ping_response)
WalSndKeepalive(false);
...
}

> - Networks conveys all the data in-time.
>
> - The subscriber consumes received data at less than half the speed at
> which the publisher sends data. In this case, while the burst
> traffic is coming, the publisher keep sending for
> wal_sender_timeout/2 seconds and it may not send a keepalive for the
> same duration. This is the correct behavior. On the other hand, the
> subscriber is kept busy without receiving a keepalive for
> wal_sender_timeout seconds. AFAICS LogicalRepApplyLoop doesn't send
> a response unless a keepalive comes while in the inner-most loop.
>

One way this could happen is that the apply is taking a long time
because of contention on subscriber, say there are a lot of other
operations going on in the subscriber or it stuck due to some reason.

> If wel_sender_timeout is relatively short (5 seconds, in the report),
> a burst (or a gap-less) logical replication traffic can continue
> easily for more than 2.5 seconds. If wal_sender_timeout is longer (1
> min, ditto), burst replication traffics last for more than
> wal_sender_timeout/2 becomes relatively not so frequent.
>
> However, I'm not sure how it makes things worse again to increase it
> further to 5 min.
>

There might be a possibility that subscriber is stuck or is extremely
slow due to other operations.

> Is my diagnostics that while the innermost loop in LogicalRepAllyLoop
> [A] is busy, it doesn't have a chance to send reply until a keepalive
> comes in correct? If so, walsender timeout due to slowness of
> subscriber happens and we might need to break the innermost loop to
> give subscriber a chance to send a response with appropriate
> intervals. This is what I wanted to propose.
>

I was thinking increasing wal_sender/receiver_timeout should solve
this problem. I am not sure why it leads to loss of WAL in the OP's
case.

> [A]
> backend/replication/logical/worker.c:2565(at)today's master
> > /* Loop to process all available data (without blocking). */
> > for (;;)
>
>
>
> > The other thing that I am not able to understand from Abhishek's reply
> > [2] is why increasing wal_sender_timeout/wal_recevier_timeout leads to
> > the removal of required WAL segments. As per my understanding, we
> > shouldn't remove WAL unless we get confirmation that the subscriber
> > has processed it.
> >
> > [1] - https://www.postgresql.org/message-id/20210610.150016.1709823354377067679.horikyota.ntt%40gmail.com
> > [2] - https://www.postgresql.org/message-id/CAEDsCzjEHLxgqa4d563CKFwSbgBvvnM91Cqfq_qoZDXCkyOsiw%40mail.gmail.com
> >
> > Note - I have added Abhishek to see if he has answers to any of these questions.
>
> Ouch! max_slot_wal_keep_size was introduced at 13. So I have no idea
> of how required segments can be removed on the publisher for now.
>
> == From the first report [1]
> sourcedb=# select * from pg_replication_slots;
> ...
> restart_lsn 116D0/C36886F8
> confirmed_flush_lsn 116D0/C3E5D370
>
> targetdb=# show wal_receiver_status_interval;
> wal_receiver_status_interval 2s
>
> targetdb=# select * from pg_stat_subscription;
> ..
> received_lsn 116D1/2BA8F170
> last_msg_send_time 2021-08-20 09:05:15.398423+09
> last_msg_receipt_time 2021-08-20 09:05:15.398471+09
> latest_end_lsn 116D1/2BA8F170
> latest_end_time 2021-08-20 09:05:15.398423+09
> ==
>
> There is a gap with about 105 segments (1.7GB) between how far the
> subscriber advanced and the publisher's idea of how far the subscriber
> advanced. But that alone cannot cause wal removal..
>

Yeah, that is quite strange.

--
With Regards,
Amit Kapila.

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2021-09-17 10:26:45 Re: Logical replication timeout problem
Previous Message Pavel Stehule 2021-09-17 10:05:04 Re: [PATCH] psql: \dn+ to show size of each schema (and \dA+ for AMs)