Re: Loss of replication after simple misconfiguration

From: hubert depesz lubaczewski <depesz(at)depesz(dot)com>
To: Michael Paquier <michael(at)paquier(dot)xyz>
Cc: Victor Yegorov <vyegorov(at)gmail(dot)com>, Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk>, pgsql-bugs mailing list <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: Loss of replication after simple misconfiguration
Date: 2020-04-10 07:26:51
Message-ID: 20200410072651.GA16098@depesz.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On Fri, Apr 10, 2020 at 01:14:34PM +0900, Michael Paquier wrote:
> Hmm. We have a gap in tests here as we don't have any tests stressing
> switchovers when it comes to track_commit_timestamps. Anyway, could
> you confirm that I got the problem right? Here is the flow I am getting
> from the information of upthread, roughly:
> 1) Primary/standby cluster, both using max_worker_processes = 8, and
> track_commit_timestamp = off.
> 2) In order to begin the switchover, first stop cleanly the primary.
> 3) Update configuration of the standby as follows, promote it and
> restart it:
> track_commit_timestamp = on
> max_worker_processes = 50
> 4) Enable streaming on the old primary to make it a standby, starting
> it fails because of the unmatching setting for max_worker_processes.
> 5) Re-adjust max_worker_processes correctly on the new standby, start
> it. Then this startup should fail at the lookup of pg_commit_ts/.

Well, no.

In our case it was *at least* this scenario:

1. master and slave both with max_worker_processes and
track_commit_timestamp off.
2. config files get changed on both to include track_commit_timestamp on
3. slave gets restarted
4. config files get changed on both to include max_worker_processes = 50
5. master gets stopped by "power outage"
6. after master re-starts, replication to slave dies.

but it could have been also different scenario

1. master and slave both with max_worker_processes and
track_commit_timestamp off.
2. config files get changed on both to include track_commit_timestamp on
3. slave gets restarted (or maybe not, we can't be sure)
4. config files get changed on both to include max_worker_processes = 50
5. set of 2 new slaves (slave2 and slave3) are setup off slave, both
with max_worker_processes = 50, and track_commit_timestamps = on
6. slave3 is modified to stream off slave2
7. master crash
8. after restars one of slaves (many?) lost its replication

Andrew suggested yesterday on IRC that it could be timing issue, so
testing for it might be complicated - hence my inability to replicate
the problem in test environment.

I will try to do the tests using extended scenarios with slave2 and
slave3, but I'm not overly optimistic about replicating this particular
case.

Best regards,

depesz

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Michael Paquier 2020-04-10 07:59:32 Re: Loss of replication after simple misconfiguration
Previous Message Michael Paquier 2020-04-10 05:43:15 Re: Loss of replication after simple misconfiguration