Re: Logical replication timeout problem

From: Fabrice Chapuis <fabrice636861(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Logical replication timeout problem
Date: 2021-09-21 08:22:32
Message-ID: CAA5-nLDhnKRwsPF8z5+ej4D9P9WTfVuMeVsigxuO=+sjf0ruyQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

If I understand, the instruction to send keep alive by the wal sender has
not been reached in the for loop, for what reason?
...
* Check for replication timeout. */
WalSndCheckTimeOut();

/* Send keepalive if the time has come */
WalSndKeepaliveIfNecessary();
...

The data load is performed on a table which is not replicated, I do not
understand why the whole transaction linked to an insert is copied to snap
files given that table does not take part of the logical replication.
We are going to do a test by modifying parameters
wal_sender_timeout/wal_receiver_timeout from 1' to 5'. The problem is that
these parameters are global and changing them will also impact the physical
replication.

Concerning the walsender timeout, when the worker is started again after a
timeout, it will trigger a new walsender associated with it.

postgres 55680 12546 0 Sep20 ? 00:00:02 postgres: aq: bgworker:
logical replication worker for subscription 24651602
postgres 55681 12546 0 Sep20 ? 00:00:00 postgres: aq: wal sender
process repuser 127.0.0.1(57930) idle

Kind Regards

Fabrice

On Tue, Sep 21, 2021 at 8:38 AM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:

> On Mon, Sep 20, 2021 at 9:43 PM Fabrice Chapuis <fabrice636861(at)gmail(dot)com>
> wrote:
> >
> > By passing the autovacuum parameter to off the problem did not occur
> right after loading the table as in our previous tests. However, the
> timeout occurred later. We have seen the accumulation of .snap files for
> several Gb.
> >
> > ...
> > -rw-------. 1 postgres postgres 16791226 Sep 20 15:26
> xid-1238444701-lsn-2D2B-F5000000.snap
> > -rw-------. 1 postgres postgres 16973268 Sep 20 15:26
> xid-1238444701-lsn-2D2B-F6000000.snap
> > -rw-------. 1 postgres postgres 16790984 Sep 20 15:26
> xid-1238444701-lsn-2D2B-F7000000.snap
> > -rw-------. 1 postgres postgres 16988112 Sep 20 15:26
> xid-1238444701-lsn-2D2B-F8000000.snap
> > -rw-------. 1 postgres postgres 16864593 Sep 20 15:26
> xid-1238444701-lsn-2D2B-F9000000.snap
> > -rw-------. 1 postgres postgres 16902167 Sep 20 15:26
> xid-1238444701-lsn-2D2B-FA000000.snap
> > -rw-------. 1 postgres postgres 16914638 Sep 20 15:26
> xid-1238444701-lsn-2D2B-FB000000.snap
> > -rw-------. 1 postgres postgres 16782471 Sep 20 15:26
> xid-1238444701-lsn-2D2B-FC000000.snap
> > -rw-------. 1 postgres postgres 16963667 Sep 20 15:27
> xid-1238444701-lsn-2D2B-FD000000.snap
> > ...
> >
>
> Okay, still not sure why the publisher is not sending keep_alive
> messages in between spilling such a big transaction. If you see, we
> have logic in WalSndLoop() wherein each time after sending data we
> check whether we need to send a keep-alive message via function
> WalSndKeepaliveIfNecessary(). I think to debug this problem further
> you need to add some logs in function WalSndKeepaliveIfNecessary() to
> see why it is not sending keep_alive messages when all these files are
> being created.
>
> Did you change the default value of
> wal_sender_timeout/wal_receiver_timeout? What is the value of those
> variables in your environment? Did you see the message "terminating
> walsender process due to replication timeout" in your server logs?
>
> --
> With Regards,
> Amit Kapila.
>

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2021-09-21 09:04:40 Re: row filtering for logical replication
Previous Message Dmitry Dolgov 2021-09-21 08:07:55 Re: POC: Cleaning up orphaned files using undo logs