Re: Loss of replication after simple misconfiguration

From: Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk>
To: hubert depesz lubaczewski <depesz(at)depesz(dot)com>
Cc: pgsql-bugs mailing list <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: Loss of replication after simple misconfiguration
Date: 2020-04-09 16:19:09
Message-ID: 878sj4skmj.fsf@news-spur.riddles.org.uk
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

>>>>> "hubert" == hubert depesz lubaczewski <depesz(at)depesz(dot)com> writes:

hubert> PostgreSQL 9.5.15 on Ubuntu bionic.
[...]
hubert> tried to restart only to be greeted by:
hubert> 2020-04-07T15:13:49.729943+00:00 postgres[20491]: [7-1] db=,user= LOG: restored log file "000000030001779200000061" from archive
hubert> 2020-04-07T15:13:49.757222+00:00 postgres[20491]: [8-1] db=,user= FATAL: could not access status of transaction 4275781146
hubert> 2020-04-07T15:13:49.757314+00:00 postgres[20491]: [8-2] db=,user= DETAIL: Could not read from file "pg_commit_ts/27D4B" at offset 245760: Success.
hubert> 2020-04-07T15:13:49.757380+00:00 postgres[20491]: [8-3] db=,user= CONTEXT: xlog redo Transaction/COMMIT: 2020-04-07 02:40:10.065859+00
hubert> 2020-04-07T15:13:49.761239+00:00 postgres[20487]: [2-1] db=,user= LOG: startup process (PID 20491) exited with exit code 1
hubert> 2020-04-07T15:13:49.761387+00:00 postgres[20487]: [3-1] db=,user= LOG: terminating any other active server processes

So I've been assisting hubert with analysis of this on IRC, and what we
have found so far suggests:

1. the max_worker_processes thing is a red herring

2. It is virtually certain that the restart, in addition to changing
max_worker_processes on the master, also changed the master's setting of
track_commit_timestamp from off to on (which is clearly relevant to the
issue)

(We established #2 from the fact that we _do_ have the WAL files from
the failed recovery, and they don't contain any COMMIT_TS_ZEROPAGE
records despite covering many thousands of transactions.)

I've suggested trying to reproduce the issue by changing this parameter
across a crash.

I did notice that 9.5.15 does have a fix for an issue in this area, but
I didn't see any more recent changes - did I miss anything?

--
Andrew (irc:RhodiumToad)

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Jehan-Guillaume de Rorthais 2020-04-09 16:46:22 Re: [BUG] non archived WAL removed during production crash recovery
Previous Message Daniel Verite 2020-04-09 15:58:24 Re: BUG #16351: PostgreSQL closing connection during requests with segmentation fault