BUG #16174: PostgreSQL 9.6 service can not start after unexpected shutdown

From: PG Bug reporting form <noreply(at)postgresql(dot)org>
To: pgsql-bugs(at)lists(dot)postgresql(dot)org
Cc: pg(dot)dba(dot)iit(dot)team(at)gmail(dot)com
Subject: BUG #16174: PostgreSQL 9.6 service can not start after unexpected shutdown
Date: 2019-12-20 12:37:20
Message-ID: 16174-8bd292aa647c8f3c@postgresql.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

The following bug has been logged on the website:

Bug reference: 16174
Logged by: Enrico La Torre
Email address: pg(dot)dba(dot)iit(dot)team(at)gmail(dot)com
PostgreSQL version: 9.6.12
Operating system: Windows Server 2016 Standard 10.0.14393
Description:

Hi,

in my company we run several thousands of PostgreSQL clusters currently with
version 9.6.12 (one cluster per server). We run PostgreSQL on a virtual
machine in Hyper-V Manager.
A minor upgrade to the latest minor version is planned but I did not find
any fix for this incident in any minor upgrade from 9.6.12 to 9.6.16.

The same incident has been reported in
https://www.postgresql.org/message-id/CAJtpbKR2NZKmoxveMSdJh%3Dzng1%2B6NwqK3EEbwQs0ac6oHE%3D4eg%40mail.gmail.com
for PostgresSQL 11 but it has not been responded to yet. However, we
experience the same incident in with PostgreSQL 9.6.

We can only correlate this incident to a reboot after an unexpected shutdown
of Windows. During restart it can(!) happen that the PostgreSQL service can
not be started by the
Service Control Manager because the PostgreSQL cluster is already running in
the background. A pg_ctl status will return a running database server. This
behavior must be related
to a race condition, because the incident does not occur every time or can
not be reproduced easily.

The unexpected shutdowns originate from either power outages or more likely
due to restarts of the server that hosts the Hyper-V Manager. If this server
gets patched and nees to restart it basically kills all running VMs in that
process, if they don't shutdown fast enough. Of course you can argue that we
must get rid of these unexpected shutdowns. But isn't the boot
process always the same after a shutdown nor matter if it was shutdown
cleanly or not? Why is the PostgreSQL cluster started in the background and
not by the Service Control Manager?
Is there a bug in the interplay between pg_ctl and the Windows Service
Control Manager?

At the bottom I attached an excerpt of the relevant log files.

Kind regards

PostgreSQL version number you are running:
9.6.12

How you installed PostgreSQL:
We initally started with PostgreSQL 9.6.6 and the EnterpriseDB installer
postgresql-9.6.6-2-windows-x64.exe
The minor upgrade to 9.6.12 was an installation with EnterpriseDB installer
postgresql-9.6.12-1-windows-x64.exe and then simply a distribution of all
9.6.12 binary files to all other servers.

Changes made to the settings in the postgresql.conf file:

version
--------------------------------------------------------------
PostgreSQL 9.6.12, compiled by Visual C++ build 1800, 64-bit

name | current_setting
| source
------------------------------+-------------------------------------------------+----------------------
application_name | psql
| client
archive_command | copy %p I:\Backup\PostgreSQL-WAL-Archive\%f
| configuration file
archive_mode | on
| configuration file
archive_timeout | 1h
| configuration file
checkpoint_completion_target | 0.9
| configuration file
client_encoding | WIN1252
| client
cluster_name | xxxxxxxxxxxx
| configuration file
DateStyle | ISO, DMY
| configuration file
default_text_search_config | simple
| configuration file
dynamic_shared_memory_type | windows
| configuration file
effective_cache_size | 3GB
| configuration file
lc_messages | C
| configuration file
lc_monetary | C
| configuration file
lc_numeric | C
| configuration file
lc_time | C
| configuration file
listen_addresses | *
| configuration file
log_checkpoints | on
| configuration file
log_connections | on
| configuration file
log_destination | stderr
| configuration file
log_disconnections | on
| configuration file
log_hostname | on
| configuration file
log_line_prefix | %t [%p]: [%l]
%quser=%u,db=%d,app=%a,client=%h | configuration file
log_temp_files | 100kB
| configuration file
log_timezone | America/Chicago
| configuration file
logging_collector | on
| configuration file
maintenance_work_mem | 256MB
| configuration file
max_connections | 100
| configuration file
max_stack_depth | 2MB
| environment variable
max_wal_senders | 10
| configuration file
max_wal_size | 4GB
| configuration file
min_wal_size | 2GB
| configuration file
port | 5432
| configuration file
search_path | "$user", public
| database
shared_buffers | 512MB
| configuration file
ssl | on
| configuration file
ssl_cert_file | ssl-cert-snakeoil.pem
| configuration file
ssl_key_file | ssl-cert-snakeoil.key
| configuration file
TimeZone | America/Chicago
| configuration file
track_activity_query_size | 8192
| configuration file
wal_level | replica
| configuration file
wal_sync_method | fsync
| configuration file
work_mem | 61166kB
| configuration file

Operating system and version:
Windows Server 2016 Standard 10.0.14393

What program you're using to connect to PostgreSQL:
psql, pgadmin4, ODBC driver PostgreSQL Unicode(x64) 9.06.04.00

Is there anything relevant or unusual in the PostgreSQL server logs?:

in pg_log

2019-12-19 03:35:06 CST [4756]: [1] LOG: database system was interrupted;
last known up at 2019-12-19 03:21:56 CST
2019-12-19 03:35:06 CST [3924]: [1]
user=[unknown],db=[unknown],app=[unknown],client=446-091STL01.aldi-499.loc
LOG: connection received: host=446-091STL01.aldi-499.loc port=62037
2019-12-19 03:35:06 CST [3788]: [1]
user=[unknown],db=[unknown],app=[unknown],client=446-091STL01.aldi-499.loc
LOG: connection received: host=446-091STL01.aldi-499.loc port=62042
2019-12-19 03:35:06 CST [5844]: [1]
user=[unknown],db=[unknown],app=[unknown],client=446-091STL01.aldi-499.loc
LOG: connection received: host=446-091STL01.aldi-499.loc port=62044
2019-12-19 03:35:07 CST [3780]: [1]
user=[unknown],db=[unknown],app=[unknown],client=446-091STL01.aldi-499.loc
LOG: connection received: host=446-091STL01.aldi-499.loc port=62043
2019-12-19 03:35:07 CST [3924]: [2]
user=[unknown],db=[unknown],app=[unknown],client=446-091STL01.aldi-499.loc
LOG: could not accept SSL connection: An established connection was aborted
by the software in your host machine.
2019-12-19 03:35:07 CST [3788]: [2]
user=[unknown],db=[unknown],app=[unknown],client=446-091STL01.aldi-499.loc
LOG: could not accept SSL connection: An established connection was aborted
by the software in your host machine.
2019-12-19 03:35:07 CST [3780]: [2]
user=[unknown],db=[unknown],app=[unknown],client=446-091STL01.aldi-499.loc
LOG: could not accept SSL connection: An established connection was aborted
by the software in your host machine.
2019-12-19 03:35:08 CST [5844]: [2]
user=SVCACC_446_PSQL,db=postgres,app=[unknown],client=446-091STL01.aldi-499.loc
FATAL: the database system is starting up
2019-12-19 03:35:09 CST [4144]: [1]
user=[unknown],db=[unknown],app=[unknown],client=446-091STL01.aldi-499.loc
LOG: connection received: host=446-091STL01.aldi-499.loc port=62046
2019-12-19 03:35:09 CST [4144]: [2]
user=SVCACC_446_PSQL,db=postgres,app=[unknown],client=446-091STL01.aldi-499.loc
FATAL: the database system is starting up
2019-12-19 03:35:10 CST [4548]: [1]
user=[unknown],db=[unknown],app=[unknown],client=446-091STL01.aldi-499.loc
LOG: connection received: host=446-091STL01.aldi-499.loc port=62048
[...] These messages continue until

2019-12-19 03:36:05 CST [5236]: [2]
user=SVCACC_446_PSQL,db=postgres,app=[unknown],client=446-091STL01.aldi-499.loc
FATAL: the database system is starting up
2019-12-19 03:37:29 CST [4756]: [2] LOG: database system was not properly
shut down; automatic recovery in progress
2019-12-19 03:37:29 CST [4756]: [3] LOG: redo starts at 26/23141D00
2019-12-19 03:37:29 CST [4756]: [4] LOG: invalid record length at
26/23151158: wanted 24, got 0
2019-12-19 03:37:29 CST [4756]: [5] LOG: redo done at 26/23151120
2019-12-19 03:37:30 CST [4756]: [6] LOG: last completed transaction was at
log time 2019-12-19 03:23:00.369744-06
2019-12-19 03:37:30 CST [4756]: [7] LOG: checkpoint starting:
end-of-recovery immediate
2019-12-19 03:37:30 CST [4756]: [8] LOG: checkpoint complete: wrote 10
buffers (0.0%); 0 transaction log file(s) added, 0 removed, 0 recycled;
write=0.006 s, sync=0.414 s, total=0.657 s; sync files=5, longest=0.098 s,
average=0.082 s; distance=61 kB, estimate=61 kB
2019-12-19 03:37:30 CST [4756]: [9] LOG: MultiXact member wraparound
protections are now enabled
2019-12-19 03:37:30 CST [5380]: [3] LOG: database system is ready to accept
connections

in Event Viewer / Application

Level Date and Time Source Event ID Task Category
Information 12/19/2019 3:34:23 AM PostgreSQL 0 None Waiting for server
startup...
Information 12/19/2019 3:35:05 AM PostgreSQL 0 None "2019-12-19 03:35:05 CST
[5380]: [1] LOG: redirecting log output to logging collector
process2019-12-19 03:35:05 CST [5380]: [2] HINT: Future log output will
appear in directory ""pg_log""."
Error 12/19/2019 3:36:06 AM PostgreSQL 0 None Timed out waiting for server
startup
Information 12/19/2019 3:39:29 AM PostgreSQL 0 None Waiting for server
startup...
Error 12/19/2019 3:39:29 AM PostgreSQL 0 None "2019-12-19 03:39:29 CST
[5216]: [1] FATAL: lock file ""postmaster.pid"" already exists2019-12-19
03:39:29 CST [5216]: [2] HINT: Is another postmaster (PID 5380) running in
data directory ""G:/databases/PostgreSQL/9.6""?"
Error 12/19/2019 3:39:30 AM PostgreSQL 0 None Timed out waiting for server
startup

in Event Viewer / System

Level Date and Time Source Event ID Task Category
Information 12/19/2019 3:31:58
AM Microsoft-Windows-Kernel-General 12 (1) The operating system started at
system time ‎2019‎-‎12‎-‎19T09:31:58.494227200Z.
Information 12/19/2019 3:31:58 AM Microsoft-Windows-Kernel-Boot 20 (31) The
last shutdown's success status was false. The last boot's success status was
true.
Critical 12/19/2019 3:32:08 AM Microsoft-Windows-Kernel-Power 41 (63) The
system has rebooted without cleanly shutting down first. This error could be
caused if the system stopped responding, crashed, or lost power
unexpectedly.
Error 12/19/2019 3:39:30 AM Service Control Manager 7001 None "The
xxx-Server service depends on the postgresql - PostgreSQL Server 9.6 service
which failed to start because of the following error: The operation
completed successfully."
Information 12/19/2019 3:39:30 AM Service Control Manager 7036 None The
postgresql - PostgreSQL Server 9.6 service entered the stopped state.

What you were doing when the error happened / how to cause the error:
with PowerShell:

# stop the cluster
pg_ctl stop -D G:\databases\PostgreSQL\9.6 -m fast
# start the cluster via Service Control Manager
Start-Service -Name postgresql

Browse pgsql-bugs by date

  From Date Subject
Next Message PG Bug reporting form 2019-12-20 13:13:25 BUG #16175: To migrate MySQL To PostgreSQL
Previous Message PG Bug reporting form 2019-12-20 10:41:32 BUG #16173: PostgreSQL 9.6 service can not start after unexpected shutdown