BUG #14820: Standby crash with "could not access status of transaction" (for track_commit_timestamp)

From: matioli(dot)matheus(at)gmail(dot)com
To: pgsql-bugs(at)postgresql(dot)org
Cc: matioli(dot)matheus(at)gmail(dot)com
Subject: BUG #14820: Standby crash with "could not access status of transaction" (for track_commit_timestamp)
Date: 2017-09-20 09:26:09
Message-ID: 20170920092609.17363.35370@wrigleys.postgresql.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

The following bug has been logged on the website:

Bug reference: 14820
Logged by: Matheus de Oliveira
Email address: matioli(dot)matheus(at)gmail(dot)com
PostgreSQL version: 9.6.4
Operating system: Linux CentOS 7
Description:

Hi dear hackers,

I have a situation that I _think_ it may be a bug, although I couldn't find
the reason for that in detail.

I have 3 servers, 1 primary and 2 standby (using streaming replication and
hot standby). All of them with 9.6.4 version. In the primary it was adjust
some parameters:

wal_level = 'logical'
max_worker_processes = 10
max_replication_slots = 10
max_wal_senders = 10
shared_preload_libraries = 'pglogical'
track_commit_timestamp = on

Since we changed `max_worker_processes` both standby crashed (that is
expected), with the messages:

2017-09-20 03:30:51.387 -03 [84754]: [2-1] LOG: replication terminated
by primary server
2017-09-20 03:30:51.387 -03 [84754]: [3-1] DETAIL: End of WAL reached
on timeline 2 at 23AF/68000098.
2017-09-20 03:30:51.388 -03 [84754]: [4-1] FATAL: could not send
end-of-streaming message to primary: no COPY in progress
2017-09-20 03:30:51.391 -03 [84739]: [6-1] LOG: invalid resource
manager ID 55 at 23AF/68000098
2017-09-20 03:30:54.525 -03 [50377]: [1-1] FATAL: could not connect to
the primary server: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
2017-09-20 03:30:56.402 -03 [50378]: [1-1] LOG: started streaming WAL
from primary at 23AF/68000000 on timeline 2
2017-09-20 03:30:56.411 -03 [84739]: [7-1] FATAL: hot standby is not
possible because max_worker_processes = 8 is a lower setting than on the
master server (its value was 10)
2017-09-20 03:30:56.411 -03 [84739]: [8-1] CONTEXT: xlog redo at
23AF/68000098 for XLOG/PARAMETER_CHANGE: max_connections=3000
max_worker_processes=10 max_prepared_xacts=0 max_locks_per_xact=64
wal_level=logical wal_log_hints=off track_commit_timestamp=on
2017-09-20 03:30:57.261 -03 [84736]: [6-1] LOG: startup process (PID
84739) exited with exit code 1
2017-09-20 03:30:57.261 -03 [84736]: [7-1] LOG: terminating any other
active server processes
2017-09-20 03:30:57.261 -03 [50376]: [3-1] WARNING: terminating
connection because of crash of another server process
2017-09-20 03:30:57.261 -03 [50376]: [4-1] DETAIL: The postmaster has
commanded this server process to roll back the current transaction and exit,
because another server process exited abnormally and possibly corrupted
shared memory.
2017-09-20 03:30:57.261 -03 [50376]: [5-1] HINT: In a moment you should
be able to reconnect to the database and repeat your command.
... => many of the above (for all connections)

2017-09-20 03:31:00.509 -03 [84736]: [8-1] LOG: database system is shut
down

After that, we adjusted all the new parameters on the standby servers the
same as the primary, and tried to started them again, with no success:

2017-09-20 03:46:36.041 -03 [50892]: [1-1] LOG: database system was
interrupted while in recovery at log time 2017-09-20 03:17:15 -03
2017-09-20 03:46:36.041 -03 [50892]: [2-1] HINT: If this has occurred
more than once some data might be corrupted and you might need to choose an
earlier recovery target.
2017-09-20 03:46:36.598 -03 [50892]: [3-1] LOG: entering standby mode
2017-09-20 03:46:36.607 -03 [50892]: [4-1] LOG: redo starts at
23AF/47E40620
2017-09-20 03:46:36.621 -03 [50892]: [5-1] FATAL: could not access
status of transaction 1556437428
2017-09-20 03:46:36.621 -03 [50892]: [6-1] DETAIL: Could not read from
file "pg_commit_ts/E7FB" at offset 229376: Success.
2017-09-20 03:46:36.621 -03 [50892]: [7-1] CONTEXT: xlog redo at
23AF/481CE818 for Transaction/COMMIT: 2017-09-20 03:17:18.239579-03
2017-09-20 03:46:36.623 -03 [50889]: [3-1] LOG: startup process (PID
50892) exited with exit code 1
2017-09-20 03:46:36.623 -03 [50889]: [4-1] LOG: terminating any other
active server processes
2017-09-20 03:46:36.721 -03 [50889]: [5-1] LOG: database system is shut
down

Now any attempt to start any of the standby servers gives the above errors
(the transaction and offset are exact the same). Do you think it may be a
bug? Notice that I loaded `pglogical`, which I don't think is the one
causing trouble, but it may be an important information. It seems to be
something related to `track_commit_timestamp` or using
`wal_level='logical'`, but I'm not sure how to debug it.

I have checked and the file `pg_commit_ts/E7FB` is on the data directory
with 229376, so looks like it is trying to read the file past its end, hence
the above error.

I have resynchronized one of the standby servers, but I left the other one
in this state so I can help with any further debugging if needed. The
instance has about 1.2 TB, so trial and error with dd and something like
that is not that simple.

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message lukas 2017-09-21 01:09:56 BUG #14821: idle_in_transaction_session_timeout sometimes gets ignored when statement timeout is pending
Previous Message Tomas Vondra 2017-09-19 22:01:32 Re: Query planner skipping index depending on DISTINCT parameter order (2)