received immediate shutdown request caused cluster failover

From: Yi Sun <yinan81(at)gmail(dot)com>
To: pgsql-general(at)lists(dot)postgresql(dot)org
Subject: received immediate shutdown request caused cluster failover
Date: 2020-11-18 03:54:15
Message-ID: CABWY_HBb3iRxYdLD2zEeGv4RMKBMP1m=3Yz8qFcKcvm93thScg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hi all

There are 3 nodes in our prd db in patroni cluster, vm01 is leader, vm02
and vm03 are standby, vm01 received immediate shutdown request caused
failover to vm02, after that vm03 received fast shutdown request

As vm03 not in cluster so have to reinit vm03

What's the possible root caused vm01 received immediate shutdown request
please?

and What's the possible root caused vm03 received fast shutdown request

Below are detail info, please check, if need any other info, I will
provide, thanks

postgresql version: 11.4
OS: centos 7
HA: patroni upgraded from 1.6.3 to 2.0.1 these days

--vm01 pg log

[2020-11-16 12:03:56.592 UTC] p=31485:3@ c=
pgwatch2(at)127(dot)0(dot)0(dot)1/eu4_baas_bckp_epmgr:pgwatch2 LOG: disconnection:
session time: 0:00:00.037 user=pgwatch2 database=eu4_baas_bckp_epmgr
host=127.0.0.1 port=36916
[2020-11-16 12:03:57.005 UTC] p=32103:14@ c=@/: LOG: received immediate
shutdown request
[2020-11-16 12:03:57.017 UTC] p=31189:3(at)314/0 c=
atp_si_user(at)10(dot)253(dot)85(dot)203/eu4_baas_software_inventory:[unknown] WARNING:
terminating connection because of crash of another server process
[2020-11-16 12:03:57.017 UTC] p=31189:4(at)314/0 c=
atp_si_user(at)10(dot)253(dot)85(dot)203/eu4_baas_software_inventory:[unknown] DETAIL:
The postmaster has commanded this server process to roll back the current
transaction and exit, because another server process exited abnormally and
possibly corrupted shared memory.
[2020-11-16 12:03:57.017 UTC] p=31189:5(at)314/0 c=
atp_si_user(at)10(dot)253(dot)85(dot)203/eu4_baas_software_inventory:[unknown] HINT: In a
moment you should be able to reconnect to the database and repeat your
command.
[2020-11-16 12:03:57.017 UTC] p=31183:3(at)313/0 c=
atp_pm(at)10(dot)253(dot)86(dot)62/eu4_baas_atp_pm:[unknown] WARNING: terminating
connection because of crash of another server process
[2020-11-16 12:03:57.017 UTC] p=31183:4(at)313/0 c=
atp_pm(at)10(dot)253(dot)86(dot)62/eu4_baas_atp_pm:[unknown] DETAIL: The postmaster has
commanded this server process to roll back the current transaction and
exit, because another server process exited abnormally and possibly
corrupted shared memory.
[2020-11-16 12:03:57.017 UTC] p=31183:5(at)313/0 c=
atp_pm(at)10(dot)253(dot)86(dot)62/eu4_baas_atp_pm:[unknown] HINT: In a moment you should
be able to reconnect to the database and repeat your command.
[2020-11-16 12:03:57.017 UTC] p=31182:3(at)310/281059 c=
bckp_epmgr(at)10(dot)253(dot)85(dot)202/eu4_baas_bckp_epmgr:[unknown] WARNING:
terminating connection because of crash of another server process
[2020-11-16 12:03:57.017 UTC] p=31182:4(at)310/281059 c=
bckp_epmgr(at)10(dot)253(dot)85(dot)202/eu4_baas_bckp_epmgr:[unknown] DETAIL: The
postmaster has commanded this server process to roll back the current
transaction and exit, because another server process exited abnormally and
possibly corrupted shared memory.
[2020-11-16 12:03:57.017 UTC] p=31182:5(at)310/281059 c=
bckp_epmgr(at)10(dot)253(dot)85(dot)202/eu4_baas_bckp_epmgr:[unknown] HINT: In a moment
you should be able to reconnect to the database and repeat your command.

--vm01 patroni log
Nov 16 12:03:56 eu4-baas-patroni-cluster-vm01 patroni[32085]: {"time":
"2020-11-16 12:03:56,922", "name": "patroni.dcs.consul", "process": 32085,
"thread": 140350381303616, "level": "WARNING", "message": "Could not
register service: unknown role type promoted"}
Nov 16 12:03:56 eu4-baas-patroni-cluster-vm01 patroni[32085]: {"time":
"2020-11-16 12:03:56,923", "name": "patroni.ha", "process": 32085,
"thread": 140350381303616, "level": "INFO", "message": "Lock owner:
eu4-baas-patroni-cluster-vm02; I am eu4-baas-patroni-cluster-vm01"}
Nov 16 12:03:56 eu4-baas-patroni-cluster-vm01 patroni[32085]: {"time":
"2020-11-16 12:03:56,923", "name": "patroni.ha", "process": 32085,
"thread": 140350381303616, "level": "INFO", "message": "does not have lock"}
Nov 16 12:03:59 eu4-baas-patroni-cluster-vm01 patroni[32085]: {"time":
"2020-11-16 12:03:59,158", "name": "patroni.dcs.consul", "process": 32085,
"thread": 140350381303616, "level": "INFO", "message": "Deregister service
pgcluster11/eu4-baas-patroni-cluster-vm01"}
Nov 16 12:03:59 eu4-baas-patroni-cluster-vm01 patroni[32085]: {"time":
"2020-11-16 12:03:59,180", "name": "patroni", "process": 32085, "thread":
140350381303616, "level": "INFO", "message": "demoting self because i do
not have the lock and i was a leader"}
Nov 16 12:03:59 eu4-baas-patroni-cluster-vm01 patroni[32085]: {"time":
"2020-11-16 12:03:59,181", "name": "patroni", "process": 32085, "thread":
140350381303616, "level": "WARNING", "message": "Loop time exceeded,
rescheduling immediately."}
Nov 16 12:03:59 eu4-baas-patroni-cluster-vm01 patroni[32085]: {"time":
"2020-11-16 12:03:59,184", "name": "patroni.ha", "process": 32085,
"thread": 140350381303616, "level": "INFO", "message": "Lock owner:
eu4-baas-patroni-cluster-vm02; I am eu4-baas-patroni-cluster-vm01"}
Nov 16 12:03:59 eu4-baas-patroni-cluster-vm01 patroni[32085]: {"time":
"2020-11-16 12:03:59,185", "name": "patroni", "process": 32085, "thread":
140350381303616, "level": "INFO", "message": "starting after demotion in
progress"}
Nov 16 12:03:59 eu4-baas-patroni-cluster-vm01 patroni[32085]: {"time":
"2020-11-16 12:03:59,186", "name": "patroni.postgresql.connection",
"process": 32085, "thread": 140349984724736, "level": "INFO", "message":
"closed patroni connection to the postgresql cluster"}
Nov 16 12:03:59 eu4-baas-patroni-cluster-vm01 patroni[32085]: {"time":
"2020-11-16 12:03:59,668", "name": "patroni.postgresql.postmaster",
"process": 32085, "thread": 140349984724736, "level": "INFO", "message":
"postmaster pid=31495"}
Nov 16 12:03:59 eu4-baas-patroni-cluster-vm01 patroni[32085]:
localhost:5432 - no response

--vm03 pg log
[2020-11-16 12:03:53.484 UTC] p=24090:1@
c=[unknown]@::1/[unknown]:[unknown] LOG: connection received: host=::1
port=39442
[2020-11-16 12:03:53.485 UTC] p=24090:2(at)9/1878954
c=replicator@::1/[unknown]:[unknown]
LOG: replication connection authorized: user=replicator
[2020-11-16 12:03:53.487 UTC] p=24090:3@ c=replicator@::1/[unknown]:[unknown]
LOG: disconnection: session time: 0:00:00.002 user=replicator database=
host=::1 port=39442
WARNING: terminating connection because of crash of another server process
DETAIL: The postmaster has commanded this server process to roll back the
current transaction and exit, because another server process exited
abnormally and possibly corrupted shared memory.
HINT: In a moment you should be able to reconnect to the database and
repeat your command.
[2020-11-16 12:03:57.108 UTC] p=27711:6@ c=@/: FATAL: could not receive
data from WAL stream: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
[2020-11-16 12:03:57.144 UTC] p=27701:7(at)1/0 c=@/: LOG: invalid record
length at E052/93EED170: wanted 24, got 0
[2020-11-16 12:03:57.419 UTC] p=24097:1@ c=@/: FATAL: could not connect to
the primary server: FATAL: the database system is shutting down

[2020-11-16 12:05:38.581 UTC] p=25380:2@ c=postgres@::1/postgres:[unknown]
FATAL: the database system is starting up
[2020-11-16 12:05:41.010 UTC] p=24113:4(at)1/0 c=@/: LOG: consistent recovery
state reached at E052/93EED170
[2020-11-16 12:05:41.010 UTC] p=24113:5(at)1/0 c=@/: LOG: invalid record
length at E052/93EED170: wanted 24, got 0
[2020-11-16 12:05:41.011 UTC] p=24109:8@ c=@/: LOG: database system is
ready to accept read only connections
[2020-11-16 12:05:43.280 UTC] p=25400:1@
c=[unknown]@::1/[unknown]:[unknown] LOG: connection received: host=::1
port=40922
[2020-11-16 12:05:43.392 UTC] p=25401:1@
c=[unknown]@::1/[unknown]:[unknown] LOG: connection received: host=::1
port=40924
[2020-11-16 12:05:43.393 UTC] p=25401:2(at)2/2 c=postgres@::1/postgres:[unknown]
LOG: connection authorized: user=postgres database=postgres
[2020-11-16 12:05:43.580 UTC] p=25402:1@
c=[unknown]@::1/[unknown]:[unknown] LOG: connection received: host=::1
port=40926
[2020-11-16 12:05:43.628 UTC] p=25402:2(at)3/1
c=replicator@::1/[unknown]:[unknown]
LOG: replication connection authorized: user=replicator
[2020-11-16 12:05:43.630 UTC] p=25402:3@ c=replicator@::1/[unknown]:[unknown]
LOG: disconnection: session time: 0:00:00.049 user=replicator database=
host=::1 port=40926
[2020-11-16 12:05:43.647 UTC] p=25406:1@
c=[unknown]@::1/[unknown]:[unknown] LOG: connection received: host=::1
port=40932
[2020-11-16 12:05:43.648 UTC] p=25406:2(at)3/2
c=replicator@::1/[unknown]:[unknown]
LOG: replication connection authorized: user=replicator
[2020-11-16 12:05:43.650 UTC] p=25406:3@ c=replicator@::1/[unknown]:[unknown]
LOG: disconnection: session time: 0:00:00.002 user=replicator database=
host=::1 port=40932
[2020-11-16 12:05:44.006 UTC] p=24109:9@ c=@/: LOG: received fast shutdown
request
[2020-11-16 12:05:44.015 UTC] p=24109:10@ c=@/: LOG: aborting any active
transactions
[2020-11-16 12:05:44.015 UTC] p=25401:3(at)2/0 c=postgres@::1/postgres:Patroni
FATAL: terminating connection due to administrator command
[2020-11-16 12:05:44.015 UTC] p=25401:4@ c=postgres@::1/postgres:Patroni
LOG: disconnection: session time: 0:00:00.622 user=postgres
database=postgres host=::1 port=40924
[2020-11-16 12:05:44.509 UTC] p=24120:1@ c=@/: LOG: shutting down
[2020-11-16 12:05:45.621 UTC] p=24109:11@ c=@/: LOG: database system is
shut down

--vm03 patroni log
Nov 16 12:03:51 eu4-baas-patroni-cluster-vm03 patroni[27664]: {"time":
"2020-11-16 12:03:51,436", "name": "patroni", "process": 27664, "thread":
140371489281856, "level": "INFO", "message": "following a different leader
because i am not the healthiest node"}

Nov 16 12:05:51 eu4-baas-patroni-cluster-vm03 patroni[27664]:
pg_controldata: could not read file
"/var/lib/pgsql/11/data/global/pg_control": read 0 of 288
Nov 16 12:05:51 eu4-baas-patroni-cluster-vm03 patroni[27664]: {"time":
"2020-11-16 12:05:51,016", "name": "patroni.postgresql", "process": 27664,
"thread": 140371489281856, "level": "ERROR", "message": "Error when calling
pg_controldata"}
Nov 16 12:05:51 eu4-baas-patroni-cluster-vm03 patroni[27664]: Traceback
(most recent call last):
Nov 16 12:05:51 eu4-baas-patroni-cluster-vm03 patroni[27664]: File
"/usr/local/lib/python3.6/site-packages/patroni/postgresql/__init__.py",
line 718, in controldata
Nov 16 12:05:51 eu4-baas-patroni-cluster-vm03 patroni[27664]: data =
subprocess.check_output([self.pgcommand('pg_controldata'), self._data_dir],
env=env)
Nov 16 12:05:51 eu4-baas-patroni-cluster-vm03 patroni[27664]: File
"/usr/lib64/python3.6/subprocess.py", line 356, in check_output
Nov 16 12:05:51 eu4-baas-patroni-cluster-vm03 patroni[27664]:
**kwargs).stdout
Nov 16 12:05:51 eu4-baas-patroni-cluster-vm03 patroni[27664]: File
"/usr/lib64/python3.6/subprocess.py", line 438, in run
Nov 16 12:05:51 eu4-baas-patroni-cluster-vm03 patroni[27664]:
output=stdout, stderr=stderr)
Nov 16 12:05:51 eu4-baas-patroni-cluster-vm03 patroni[27664]:
subprocess.CalledProcessError: Command
'['/usr/pgsql-11/bin/pg_controldata', '/var/lib/pgsql/11/data']' returned
non-zero exit status 1.
Nov 16 12:05:51 eu4-baas-patroni-cluster-vm03 patroni[27664]: {"time":
"2020-11-16 12:05:51,018", "name": "patroni.ha", "process": 27664,
"thread": 140371489281856, "level": "INFO", "message": "pg_controldata:
Nov 16 12:05:51 eu4-baas-patroni-cluster-vm03 patroni[27664]: "}
Nov 16 12:05:51 eu4-baas-patroni-cluster-vm03 patroni[27664]: {"time":
"2020-11-16 12:05:51,022", "name": "patroni.ha", "process": 27664,
"thread": 140371489281856, "level": "INFO", "message": "Lock owner:
eu4-baas-patroni-cluster-vm02; I am eu4-baas-patroni-cluster-vm03"}
Nov 16 12:05:51 eu4-baas-patroni-cluster-vm03 patroni[27664]: {"time":
"2020-11-16 12:05:51,022", "name": "patroni.ha", "process": 27664,
"thread": 140371489281856, "level": "INFO", "message": "Lock owner:
eu4-baas-patroni-cluster-vm02; I am eu4-baas-patroni-cluster-vm03"}
Nov 16 12:05:51 eu4-baas-patroni-cluster-vm03 patroni[27664]: {"time":
"2020-11-16 12:05:51,031", "name": "patroni.dcs.consul", "process": 27664,
"thread": 140371489281856, "level": "INFO", "message": "Deregister service
pgcluster11/eu4-baas-patroni-cluster-vm03"}
Nov 16 12:05:51 eu4-baas-patroni-cluster-vm03 patroni[27664]: {"time":
"2020-11-16 12:05:51,048", "name": "patroni", "process": 27664, "thread":
140371489281856, "level": "INFO", "message": "starting as a secondary"}
Nov 16 12:05:51 eu4-baas-patroni-cluster-vm03 patroni[27664]: {"time":
"2020-11-16 12:05:51,049", "name": "patroni.postgresql.connection",
"process": 27664, "thread": 140371151345408, "level": "INFO", "message":
"closed patroni connection to the postgresql cluster"}

Best regards
Oliver Sun

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Marcin Giedz 2020-11-18 07:10:52 Re: pg_upgrade from 12 to 13 failes with plpython2
Previous Message David G. Johnston 2020-11-18 03:37:28 Re: create type with %type or %rowtype