Re: WAL accumulating, Logical Replication pg 13

From: Willy-Bas Loos <willybas(at)gmail(dot)com>
To: Tomas Pospisek <tpo2(at)sourcepole(dot)ch>
Cc: Vijaykumar Jain <vijaykumarjain(dot)github(at)gmail(dot)com>, pgsql-general <pgsql-general(at)lists(dot)postgresql(dot)org>
Subject: Re: WAL accumulating, Logical Replication pg 13
Date: 2021-07-15 10:42:59
Message-ID: CAHnozTgPxZoPV-3sCnnUusK9dgzzJjAL6xAfMUBbKSqXa-onxA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hi, here's an update on this old thread.
I've found the problem, it was because of long transactions.
I reproduced the problem when I added a second publication and another
postgres cluster with a subscription to it (on acceptance still).

This query came in handy to show me what was happening:

select r.pid, a.state as query_state, a.wait_event, r.application_name,
r.state as replication_state,
-- now()-a.query_start as query_time,
r.write_lag,
pg_size_pretty(pg_wal_lsn_diff(r.sent_lsn,r.write_lsn)) as write_lag2,
r.flush_lag,
pg_size_pretty(pg_wal_lsn_diff(r.sent_lsn,r.flush_lsn)) as flush_lag2,
r.replay_lag,
pg_size_pretty(pg_wal_lsn_diff(r.sent_lsn,r.replay_lsn)) as replay_lag2,
r.client_addr, a.backend_start, a.state_change, a.query_start,
now()
from pg_stat_replication r
left join pg_stat_activity a on a.pid = r.pid
order by r.application_name;

What i saw was that replication wold "start", then enter a "catchup" state
and detect about 5GB of data, wich it would then process, so i would see
the "lag_..." numbers decline as i repeated that query.
When the number hit 0 or a slightly negative number, it would take a minute
(literally) and then the process was repeated. BTW I was seeing negative
numbers at some point so i swapped the values (e.g. r.write_lsn,
r.sent_lsn) and changed them back later. So I think that the 5GB was
actually -5GB, strangely. During the minute wait at a near-zero lag, the
query state would be "ReorderBufferWrite".

In the log there was an important extra line that I hadn't noticed before,
because it doesn't occur as often as all the lines about the workers
exiting in an error.
2021-07-14 14:04:58.110 CEST [22782] ERROR: terminating logical replication
worker due to timeout 2021-07-14 14:04:58.112 CEST [3720] LOG: background
worker "logical replication worker" (PID 22782) exited with exit code 1

I was already suspecting long transactions to be the problem, so I thought
that the timeout might be the problem.
At that point I changed the wal_receiver_timeout option to 0 in the
postgresql.conf file and the above behaviour ceased.
Instead I saw the lag values increase rapidly, with about 100GB per hour
(not a high traffic database).
There were long transactions on the subscriber database, but merely killing
those precesses wasn't enough because they would soon be started again.
I had to disable the job that started them for about 15 minutes, that did
the job. After that, those long transactions were not a problem anymore.

A guess at what is causing this:
The long transactions on the subscriber node are only a problem during the
initial snapshot phase. It uses a transaction and needs heavy locks on the
tables that are to be replicated.
The initial snapshot was cancelled because it didn't end successfully
within the wal_receiver_timeout, so it was restarted later.
When i disabled the timeout, all the WAL on te server was retained. There
were still periodical retries, there was no pending request for a lock. Or
it would have gotten one the moment I killed that long transaction.
But every time that the snapshot process was retried, there was already a
new long transaction preventing the lock for the snapshot.
I had to prevent long transactions for long enough so that the initial
snapshot could get a lock on the subscriber. However, I don't know the
length of the timeout that defines how long a transaction can be without
disturbing the snapshot.
Please correct me where I'm wrong.

Cheers,

Willy Loos

On Fri, Jun 11, 2021 at 9:51 AM Willy-Bas Loos <willybas(at)gmail(dot)com> wrote:

> Hi, I was going to follow up on this one, sorry for the long silence.
> The replication is working fine now, and I have no idea what the problem
> was. Not cool.
> If I find out, I will let you know.
>
> On Mon, May 31, 2021 at 6:06 PM Tomas Pospisek <tpo2(at)sourcepole(dot)ch> wrote:
>
>> Hi Willy-Bas Loos,
>>
>> On 31.05.21 17:32, Willy-Bas Loos wrote:
>> >
>> >
>> > On Mon, May 31, 2021 at 4:24 PM Vijaykumar Jain
>> > <vijaykumarjain(dot)github(at)gmail(dot)com
>> > <mailto:vijaykumarjain(dot)github(at)gmail(dot)com>> wrote:
>> >
>> > So I got it all wrong it seems :)
>> >
>> > Thank you for taking the time to help me!
>> >
>> > You upgraded to pg13 fine? , but while on pg13 you have issues with
>> > logical replication ?
>> >
>> > Yes, the upgrade went fine. So here are some details:
>> > I already had londiste running on postgres 9.3, but londiste wouldn't
>> > run on Debian 10
>> > So i first made the new server Debian 9 with postgres 9.6 and i started
>> > replicating with londiste from 9.3 to 9.6
>> > When all was ready, i stopped the replication to the 9.6 server and
>> > deleted all londiste & pgq content with drop schema cascade.
>> > Then I upgraded the server to Debian 10. Then i user pg_upgrade to
>> > upgrade from postgres 9.6 to 13. (PostGIS versions were kept
>> compatible).
>> > Then I added logical replication and a third server as a subscriber.
>> >
>> > I was going to write that replication is working fine (since the table
>> > contains a lot of data and there are no conflicts in the log), but it
>> > turns out that it isn't.
>> > The subscriber is behind and It looks like there hasn't been any
>> > incoming data after the initial data synchronization.
>> > So at least now i know that the WAL is being retained with a reason.
>> The
>> > connection is working properly (via psql anyway)
>>
>> I once maybe had a similar problem due to some ports that were needed
>> for replication being firewalled off or respectively the master having
>> the wrong IP address of the old master (now standby server) or such.
>>
>> There was absolutely no word anywhere in any log about the problem I was
>> just seeing the new postgres master not starting up after hours and
>> hours of waiting after a failover. I somehow found out about the
>> required port being blocked (I don't remember - maybe seing the
>> unanswered SYNs in tcpdump? Or via ufw log entries?).
>>
>> > I will also look into how to diagnose this from the system tables, e.g.
>> > substracting LSN's to get some quantitative measure for the lag.
>> >
>> >
>> >
>> > There is a path in the postgresql source user subscription folder
>> > iirc which covers various logical replication scenarios.
>> > That may help you just in case.
>> >
>> > OK, so comments in the source code you mean?
>> >
>>
>>
>
> --
> Willy-Bas Loos
>

--
Willy-Bas Loos

In response to

Browse pgsql-general by date

  From Date Subject
Next Message hubert depesz lubaczewski 2021-07-15 12:15:09 Re: looping over multirange segments?
Previous Message Lucas 2021-07-15 10:34:38 Re: PostgreSQL 9.2 high replication lag