Loss of replication after simple misconfiguration

From: hubert depesz lubaczewski <depesz(at)depesz(dot)com>
To: pgsql-bugs mailing list <pgsql-bugs(at)postgresql(dot)org>
Subject: Loss of replication after simple misconfiguration
Date: 2020-04-09 14:08:36
Message-ID: 20200409140836.GA5990@depesz.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Hi,

I was trying to get it to replicate, but I couldn't in controlled
environment, so I have no repeatable proof of concept, but we've seen
the problem at least 4 times.

Configuration:

PostgreSQL 9.5.15 on Ubuntu bionic.
Streaming replication from master to slave, working without problem.
max_worker_processes set to 8 on both servers.

Preparation:
At some point in time, to accommodate upcoming (yes, really) upgrade to
pg12, we pushed change that write "max_worker_processes = 50" to all
configs, but didn't restart Pgs.

Fail scenario:
Master died (underlying AWS EC2 instance got hosed, was restarted).
Master got back up, but loaded new configuration from conf file.

Immediately it crashed slave with message:
2020-04-09 15:55:45.653 CEST 1304738 LOG: started streaming WAL from primary at 3/D000000 on timeline 1
2020-04-09 15:55:45.679 CEST 505710 FATAL: hot standby is not possible because max_worker_processes = 8 is a lower setting than on the master server (its value was 50)
2020-04-09 15:55:45.679 CEST 505710 CONTEXT: xlog redo XLOG/PARAMETER_CHANGE: max_connections=100 max_worker_processes=50 max_prepared_xacts=0 max_locks_per_xact=64 wal_level=logical wal_log_hints=off track_commit_timestamp=off
(this message is from my test environment, but the idea is clear).

Given that we have max_worker_processes = 50 in config on slave, we
tried to restart only to be greeted by:
2020-04-07T15:13:49.729943+00:00 postgres[20491]: [7-1] db=,user= LOG: restored log file "000000030001779200000061" from archive
2020-04-07T15:13:49.757222+00:00 postgres[20491]: [8-1] db=,user= FATAL: could not access status of transaction 4275781146
2020-04-07T15:13:49.757314+00:00 postgres[20491]: [8-2] db=,user= DETAIL: Could not read from file "pg_commit_ts/27D4B" at offset 245760: Success.
2020-04-07T15:13:49.757380+00:00 postgres[20491]: [8-3] db=,user= CONTEXT: xlog redo Transaction/COMMIT: 2020-04-07 02:40:10.065859+00
2020-04-07T15:13:49.761239+00:00 postgres[20487]: [2-1] db=,user= LOG: startup process (PID 20491) exited with exit code 1
2020-04-07T15:13:49.761387+00:00 postgres[20487]: [3-1] db=,user= LOG: terminating any other active server processes

We had similar case (could not access status of transaction) on at least
4 separate pg installations, so it looks to be related to the
restart of master while bad max_worker_processes are set.

We have since rolled back the change to max_worker_processes, and will
be doing it in much saner/safer way, but the fact that slave can't start
after such master restart is troubling.

Unfortunately we don't have any instances broken in this way anymore,
and we don't have logs either - we had to rebuild the Pg setup
(master/slave) ASAP, and thus the evidence is lost.

Is there anything we could have done to fix it once it happened?

Is it a known problem?

Or maybe the max_worker_processes mismatch is redherring and the problem
was simply because master died while there was heavy write traffic with
lots of transactions?

Best regards,

depesz

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Peter Eisentraut 2020-04-09 14:48:29 Re: BUG #16342: CREATE TABLE LIKE INCLUDING GENERATED column order issue
Previous Message Tom Lane 2020-04-09 13:40:00 Re: BUG #16351: PostgreSQL closing connection during requests with segmentation fault