Re: Logical replication timeout problem

From: Fabrice Chapuis <fabrice636861(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Logical replication timeout problem
Date: 2021-09-21 15:41:50
Message-ID: CAA5-nLCs+XwLma7KPo_GJTKnhXU2YcTD9igMC8uDJV76xk72zQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> IIUC, these are called after processing each WAL record so not
sure how is it possible in your case that these are not reached?

I don't know, as you say, to highlight the problem we would have to debug
the WalSndKeepaliveIfNecessary function

> I was curious to know if the walsender has exited before walreceiver

During the last tests we made we didn't observe any timeout of the wal
sender process.

> Do you mean you are planning to change from 1 minute to 5 minutes?

We set wal_sender_timeout/wal_receiver_timeout to 5' and launch new test.
The result is surprising and rather positive there is no timeout any more
in the log and the 20Gb of snap files are removed in less than 5 minutes.
How to explain that behaviour, why the snap files are consumed suddenly so
quickly.
I choose the value arbitrarily for wal_sender_timeout/wal_receiver_timeout
parameters, are theses values appropriate from your point of view?

Best Regards

Fabrice

On Tue, Sep 21, 2021 at 11:52 AM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
wrote:

> On Tue, Sep 21, 2021 at 1:52 PM Fabrice Chapuis <fabrice636861(at)gmail(dot)com>
> wrote:
> >
> > If I understand, the instruction to send keep alive by the wal sender
> has not been reached in the for loop, for what reason?
> > ...
> > * Check for replication timeout. */
> > WalSndCheckTimeOut();
> >
> > /* Send keepalive if the time has come */
> > WalSndKeepaliveIfNecessary();
> > ...
> >
>
> Are you sure that these functions have not been called? Or the case is
> that these are called but due to some reason the keep-alive is not
> sent? IIUC, these are called after processing each WAL record so not
> sure how is it possible in your case that these are not reached?
>
> > The data load is performed on a table which is not replicated, I do not
> understand why the whole transaction linked to an insert is copied to snap
> files given that table does not take part of the logical replication.
> >
>
> It is because we don't know till the end of the transaction (where we
> start sending the data) whether the table will be replicated or not. I
> think specifically for this purpose the new 'streaming' feature
> introduced in PG-14 will help us to avoid writing data of such tables
> to snap/spill files. See 'streaming' option in Create Subscription
> docs [1].
>
> > We are going to do a test by modifying parameters
> wal_sender_timeout/wal_receiver_timeout from 1' to 5'. The problem is that
> these parameters are global and changing them will also impact the physical
> replication.
> >
>
> Do you mean you are planning to change from 1 minute to 5 minutes? I
> agree with the global nature of parameters and I think your approach
> to finding out the root cause is good here because otherwise, under
> some similar or more heavy workload, it might lead to the same
> situation.
>
> > Concerning the walsender timeout, when the worker is started again after
> a timeout, it will trigger a new walsender associated with it.
> >
>
> Right, I know that but I was curious to know if the walsender has
> exited before walreceiver.
>
> [1] - https://www.postgresql.org/docs/devel/sql-createsubscription.html
>
> --
> With Regards,
> Amit Kapila.
>

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bossart, Nathan 2021-09-21 15:46:41 Re: Estimating HugePages Requirements?
Previous Message Robert Haas 2021-09-21 15:25:03 Re: refactoring basebackup.c