Re: Logical replication timeout problem

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Fabrice Chapuis <fabrice636861(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Logical replication timeout problem
Date: 2021-09-21 09:52:29
Message-ID: CAA4eK1JYB2eLNTs48szQ8oJXW9GLcrxT0yWETQS0gF2sVBxMYA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Sep 21, 2021 at 1:52 PM Fabrice Chapuis <fabrice636861(at)gmail(dot)com> wrote:
>
> If I understand, the instruction to send keep alive by the wal sender has not been reached in the for loop, for what reason?
> ...
> * Check for replication timeout. */
> WalSndCheckTimeOut();
>
> /* Send keepalive if the time has come */
> WalSndKeepaliveIfNecessary();
> ...
>

Are you sure that these functions have not been called? Or the case is
that these are called but due to some reason the keep-alive is not
sent? IIUC, these are called after processing each WAL record so not
sure how is it possible in your case that these are not reached?

> The data load is performed on a table which is not replicated, I do not understand why the whole transaction linked to an insert is copied to snap files given that table does not take part of the logical replication.
>

It is because we don't know till the end of the transaction (where we
start sending the data) whether the table will be replicated or not. I
think specifically for this purpose the new 'streaming' feature
introduced in PG-14 will help us to avoid writing data of such tables
to snap/spill files. See 'streaming' option in Create Subscription
docs [1].

> We are going to do a test by modifying parameters wal_sender_timeout/wal_receiver_timeout from 1' to 5'. The problem is that these parameters are global and changing them will also impact the physical replication.
>

Do you mean you are planning to change from 1 minute to 5 minutes? I
agree with the global nature of parameters and I think your approach
to finding out the root cause is good here because otherwise, under
some similar or more heavy workload, it might lead to the same
situation.

> Concerning the walsender timeout, when the worker is started again after a timeout, it will trigger a new walsender associated with it.
>

Right, I know that but I was curious to know if the walsender has
exited before walreceiver.

[1] - https://www.postgresql.org/docs/devel/sql-createsubscription.html

--
With Regards,
Amit Kapila.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2021-09-21 09:59:25 Re: POC: Cleaning up orphaned files using undo logs
Previous Message Amit Kapila 2021-09-21 09:04:40 Re: row filtering for logical replication