Re: Logical replication timeout problem

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Fabrice Chapuis <fabrice636861(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Logical replication timeout problem
Date: 2021-09-21 06:38:28
Message-ID: CAA4eK1LeN1V85i2ZfU2cOj5vJjpEVSvJ6LOfAu-u7mfQrW=v1Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Sep 20, 2021 at 9:43 PM Fabrice Chapuis <fabrice636861(at)gmail(dot)com> wrote:
>
> By passing the autovacuum parameter to off the problem did not occur right after loading the table as in our previous tests. However, the timeout occurred later. We have seen the accumulation of .snap files for several Gb.
>
> ...
> -rw-------. 1 postgres postgres 16791226 Sep 20 15:26 xid-1238444701-lsn-2D2B-F5000000.snap
> -rw-------. 1 postgres postgres 16973268 Sep 20 15:26 xid-1238444701-lsn-2D2B-F6000000.snap
> -rw-------. 1 postgres postgres 16790984 Sep 20 15:26 xid-1238444701-lsn-2D2B-F7000000.snap
> -rw-------. 1 postgres postgres 16988112 Sep 20 15:26 xid-1238444701-lsn-2D2B-F8000000.snap
> -rw-------. 1 postgres postgres 16864593 Sep 20 15:26 xid-1238444701-lsn-2D2B-F9000000.snap
> -rw-------. 1 postgres postgres 16902167 Sep 20 15:26 xid-1238444701-lsn-2D2B-FA000000.snap
> -rw-------. 1 postgres postgres 16914638 Sep 20 15:26 xid-1238444701-lsn-2D2B-FB000000.snap
> -rw-------. 1 postgres postgres 16782471 Sep 20 15:26 xid-1238444701-lsn-2D2B-FC000000.snap
> -rw-------. 1 postgres postgres 16963667 Sep 20 15:27 xid-1238444701-lsn-2D2B-FD000000.snap
> ...
>

Okay, still not sure why the publisher is not sending keep_alive
messages in between spilling such a big transaction. If you see, we
have logic in WalSndLoop() wherein each time after sending data we
check whether we need to send a keep-alive message via function
WalSndKeepaliveIfNecessary(). I think to debug this problem further
you need to add some logs in function WalSndKeepaliveIfNecessary() to
see why it is not sending keep_alive messages when all these files are
being created.

Did you change the default value of
wal_sender_timeout/wal_receiver_timeout? What is the value of those
variables in your environment? Did you see the message "terminating
walsender process due to replication timeout" in your server logs?

--
With Regards,
Amit Kapila.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Gavin Flower 2021-09-21 06:49:36 Re: Release 14 Schedule
Previous Message vignesh C 2021-09-21 06:12:07 Re: Added schema level support for publication.