Re: received immediate shutdown request caused cluster failover

From: Yi Sun <yinan81(at)gmail(dot)com>
To: pgsql-general(at)lists(dot)postgresql(dot)org
Subject: Re: received immediate shutdown request caused cluster failover
Date: 2020-11-20 01:20:00
Message-ID: CABWY_HCqpj26Svco_AnveXXJorbHcyCok2QQP4arpc2-RyJN2Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hi guys,

Besides command run(like pg_ctl) can cause "received immediate shutdown
request" any other reason can cause this please?

This production DB, support colleague said didn't run it

Yi Sun <yinan81(at)gmail(dot)com> 于2020年11月18日周三 上午11:54写道:

> Hi all
>
> There are 3 nodes in our prd db in patroni cluster, vm01 is leader, vm02
> and vm03 are standby, vm01 received immediate shutdown request caused
> failover to vm02, after that vm03 received fast shutdown request
>
> As vm03 not in cluster so have to reinit vm03
>
> What's the possible root caused vm01 received immediate shutdown request
> please?
>
> and What's the possible root caused vm03 received fast shutdown request
>
> Below are detail info, please check, if need any other info, I will
> provide, thanks
>
> postgresql version: 11.4
> OS: centos 7
> HA: patroni upgraded from 1.6.3 to 2.0.1 these days
>
> --vm01 pg log
>
> [2020-11-16 12:03:56.592 UTC] p=31485:3@ c=
> pgwatch2(at)127(dot)0(dot)0(dot)1/eu4_baas_bckp_epmgr:pgwatch2 LOG: disconnection:
> session time: 0:00:00.037 user=pgwatch2 database=eu4_baas_bckp_epmgr
> host=127.0.0.1 port=36916
> [2020-11-16 12:03:57.005 UTC] p=32103:14@ c=@/: LOG: received immediate
> shutdown request
> [2020-11-16 12:03:57.017 UTC] p=31189:3(at)314/0 c=
> atp_si_user(at)10(dot)253(dot)85(dot)203/eu4_baas_software_inventory:[unknown]
> <http://atp_si_user(at)10(dot)253(dot)85(dot)203/eu4_baas_software_inventory:%5Bunknown%5D>
> WARNING: terminating connection because of crash of another server process
> [2020-11-16 12:03:57.017 UTC] p=31189:4(at)314/0 c=
> atp_si_user(at)10(dot)253(dot)85(dot)203/eu4_baas_software_inventory:[unknown]
> <http://atp_si_user(at)10(dot)253(dot)85(dot)203/eu4_baas_software_inventory:%5Bunknown%5D>
> DETAIL: The postmaster has commanded this server process to roll back the
> current transaction and exit, because another server process exited
> abnormally and possibly corrupted shared memory.
> [2020-11-16 12:03:57.017 UTC] p=31189:5(at)314/0 c=
> atp_si_user(at)10(dot)253(dot)85(dot)203/eu4_baas_software_inventory:[unknown]
> <http://atp_si_user(at)10(dot)253(dot)85(dot)203/eu4_baas_software_inventory:%5Bunknown%5D>
> HINT: In a moment you should be able to reconnect to the database and
> repeat your command.
> [2020-11-16 12:03:57.017 UTC] p=31183:3(at)313/0 c=
> atp_pm(at)10(dot)253(dot)86(dot)62/eu4_baas_atp_pm:[unknown]
> <http://atp_pm(at)10(dot)253(dot)86(dot)62/eu4_baas_atp_pm:%5Bunknown%5D> WARNING:
> terminating connection because of crash of another server process
> [2020-11-16 12:03:57.017 UTC] p=31183:4(at)313/0 c=
> atp_pm(at)10(dot)253(dot)86(dot)62/eu4_baas_atp_pm:[unknown]
> <http://atp_pm(at)10(dot)253(dot)86(dot)62/eu4_baas_atp_pm:%5Bunknown%5D> DETAIL: The
> postmaster has commanded this server process to roll back the current
> transaction and exit, because another server process exited abnormally and
> possibly corrupted shared memory.
> [2020-11-16 12:03:57.017 UTC] p=31183:5(at)313/0 c=
> atp_pm(at)10(dot)253(dot)86(dot)62/eu4_baas_atp_pm:[unknown]
> <http://atp_pm(at)10(dot)253(dot)86(dot)62/eu4_baas_atp_pm:%5Bunknown%5D> HINT: In a
> moment you should be able to reconnect to the database and repeat your
> command.
> [2020-11-16 12:03:57.017 UTC] p=31182:3(at)310/281059 c=
> bckp_epmgr(at)10(dot)253(dot)85(dot)202/eu4_baas_bckp_epmgr:[unknown]
> <http://bckp_epmgr(at)10(dot)253(dot)85(dot)202/eu4_baas_bckp_epmgr:%5Bunknown%5D>
> WARNING: terminating connection because of crash of another server process
> [2020-11-16 12:03:57.017 UTC] p=31182:4(at)310/281059 c=
> bckp_epmgr(at)10(dot)253(dot)85(dot)202/eu4_baas_bckp_epmgr:[unknown]
> <http://bckp_epmgr(at)10(dot)253(dot)85(dot)202/eu4_baas_bckp_epmgr:%5Bunknown%5D>
> DETAIL: The postmaster has commanded this server process to roll back the
> current transaction and exit, because another server process exited
> abnormally and possibly corrupted shared memory.
> [2020-11-16 12:03:57.017 UTC] p=31182:5(at)310/281059 c=
> bckp_epmgr(at)10(dot)253(dot)85(dot)202/eu4_baas_bckp_epmgr:[unknown]
> <http://bckp_epmgr(at)10(dot)253(dot)85(dot)202/eu4_baas_bckp_epmgr:%5Bunknown%5D> HINT:
> In a moment you should be able to reconnect to the database and repeat
> your command.
>
> --vm01 patroni log
> Nov 16 12:03:56 eu4-baas-patroni-cluster-vm01 patroni[32085]: {"time":
> "2020-11-16 12:03:56,922", "name": "patroni.dcs.consul", "process": 32085,
> "thread": 140350381303616, "level": "WARNING", "message": "Could not
> register service: unknown role type promoted"}
> Nov 16 12:03:56 eu4-baas-patroni-cluster-vm01 patroni[32085]: {"time":
> "2020-11-16 12:03:56,923", "name": "patroni.ha", "process": 32085,
> "thread": 140350381303616, "level": "INFO", "message": "Lock owner:
> eu4-baas-patroni-cluster-vm02; I am eu4-baas-patroni-cluster-vm01"}
> Nov 16 12:03:56 eu4-baas-patroni-cluster-vm01 patroni[32085]: {"time":
> "2020-11-16 12:03:56,923", "name": "patroni.ha", "process": 32085,
> "thread": 140350381303616, "level": "INFO", "message": "does not have lock"}
> Nov 16 12:03:59 eu4-baas-patroni-cluster-vm01 patroni[32085]: {"time":
> "2020-11-16 12:03:59,158", "name": "patroni.dcs.consul", "process": 32085,
> "thread": 140350381303616, "level": "INFO", "message": "Deregister service
> pgcluster11/eu4-baas-patroni-cluster-vm01"}
> Nov 16 12:03:59 eu4-baas-patroni-cluster-vm01 patroni[32085]: {"time":
> "2020-11-16 12:03:59,180", "name": "patroni", "process": 32085, "thread":
> 140350381303616, "level": "INFO", "message": "demoting self because i do
> not have the lock and i was a leader"}
> Nov 16 12:03:59 eu4-baas-patroni-cluster-vm01 patroni[32085]: {"time":
> "2020-11-16 12:03:59,181", "name": "patroni", "process": 32085, "thread":
> 140350381303616, "level": "WARNING", "message": "Loop time exceeded,
> rescheduling immediately."}
> Nov 16 12:03:59 eu4-baas-patroni-cluster-vm01 patroni[32085]: {"time":
> "2020-11-16 12:03:59,184", "name": "patroni.ha", "process": 32085,
> "thread": 140350381303616, "level": "INFO", "message": "Lock owner:
> eu4-baas-patroni-cluster-vm02; I am eu4-baas-patroni-cluster-vm01"}
> Nov 16 12:03:59 eu4-baas-patroni-cluster-vm01 patroni[32085]: {"time":
> "2020-11-16 12:03:59,185", "name": "patroni", "process": 32085, "thread":
> 140350381303616, "level": "INFO", "message": "starting after demotion in
> progress"}
> Nov 16 12:03:59 eu4-baas-patroni-cluster-vm01 patroni[32085]: {"time":
> "2020-11-16 12:03:59,186", "name": "patroni.postgresql.connection",
> "process": 32085, "thread": 140349984724736, "level": "INFO", "message":
> "closed patroni connection to the postgresql cluster"}
> Nov 16 12:03:59 eu4-baas-patroni-cluster-vm01 patroni[32085]: {"time":
> "2020-11-16 12:03:59,668", "name": "patroni.postgresql.postmaster",
> "process": 32085, "thread": 140349984724736, "level": "INFO", "message":
> "postmaster pid=31495"}
> Nov 16 12:03:59 eu4-baas-patroni-cluster-vm01 patroni[32085]:
> localhost:5432 - no response
>
> --vm03 pg log
> [2020-11-16 12:03:53.484 UTC] p=24090:1@
> c=[unknown]@::1/[unknown]:[unknown] LOG: connection received: host=::1
> port=39442
> [2020-11-16 12:03:53.485 UTC] p=24090:2(at)9/1878954 c=replicator@::1/[unknown]:[unknown]
> LOG: replication connection authorized: user=replicator
> [2020-11-16 12:03:53.487 UTC] p=24090:3@ c=replicator@::1/[unknown]:[unknown]
> LOG: disconnection: session time: 0:00:00.002 user=replicator database=
> host=::1 port=39442
> WARNING: terminating connection because of crash of another server process
> DETAIL: The postmaster has commanded this server process to roll back the
> current transaction and exit, because another server process exited
> abnormally and possibly corrupted shared memory.
> HINT: In a moment you should be able to reconnect to the database and
> repeat your command.
> [2020-11-16 12:03:57.108 UTC] p=27711:6@ c=@/: FATAL: could not receive
> data from WAL stream: server closed the connection unexpectedly
> This probably means the server terminated abnormally
> before or while processing the request.
> [2020-11-16 12:03:57.144 UTC] p=27701:7(at)1/0 c=@/: LOG: invalid record
> length at E052/93EED170: wanted 24, got 0
> [2020-11-16 12:03:57.419 UTC] p=24097:1@ c=@/: FATAL: could not connect
> to the primary server: FATAL: the database system is shutting down
>
> [2020-11-16 12:05:38.581 UTC] p=25380:2@ c=postgres@::1/postgres:[unknown]
> FATAL: the database system is starting up
> [2020-11-16 12:05:41.010 UTC] p=24113:4(at)1/0 c=@/: LOG: consistent
> recovery state reached at E052/93EED170
> [2020-11-16 12:05:41.010 UTC] p=24113:5(at)1/0 c=@/: LOG: invalid record
> length at E052/93EED170: wanted 24, got 0
> [2020-11-16 12:05:41.011 UTC] p=24109:8@ c=@/: LOG: database system is
> ready to accept read only connections
> [2020-11-16 12:05:43.280 UTC] p=25400:1@
> c=[unknown]@::1/[unknown]:[unknown] LOG: connection received: host=::1
> port=40922
> [2020-11-16 12:05:43.392 UTC] p=25401:1@
> c=[unknown]@::1/[unknown]:[unknown] LOG: connection received: host=::1
> port=40924
> [2020-11-16 12:05:43.393 UTC] p=25401:2(at)2/2 c=postgres@::1/postgres:[unknown]
> LOG: connection authorized: user=postgres database=postgres
> [2020-11-16 12:05:43.580 UTC] p=25402:1@
> c=[unknown]@::1/[unknown]:[unknown] LOG: connection received: host=::1
> port=40926
> [2020-11-16 12:05:43.628 UTC] p=25402:2(at)3/1 c=replicator@::1/[unknown]:[unknown]
> LOG: replication connection authorized: user=replicator
> [2020-11-16 12:05:43.630 UTC] p=25402:3@ c=replicator@::1/[unknown]:[unknown]
> LOG: disconnection: session time: 0:00:00.049 user=replicator database=
> host=::1 port=40926
> [2020-11-16 12:05:43.647 UTC] p=25406:1@
> c=[unknown]@::1/[unknown]:[unknown] LOG: connection received: host=::1
> port=40932
> [2020-11-16 12:05:43.648 UTC] p=25406:2(at)3/2 c=replicator@::1/[unknown]:[unknown]
> LOG: replication connection authorized: user=replicator
> [2020-11-16 12:05:43.650 UTC] p=25406:3@ c=replicator@::1/[unknown]:[unknown]
> LOG: disconnection: session time: 0:00:00.002 user=replicator database=
> host=::1 port=40932
> [2020-11-16 12:05:44.006 UTC] p=24109:9@ c=@/: LOG: received fast
> shutdown request
> [2020-11-16 12:05:44.015 UTC] p=24109:10@ c=@/: LOG: aborting any active
> transactions
> [2020-11-16 12:05:44.015 UTC] p=25401:3(at)2/0 c=postgres@::1/postgres:Patroni
> FATAL: terminating connection due to administrator command
> [2020-11-16 12:05:44.015 UTC] p=25401:4@ c=postgres@::1/postgres:Patroni
> LOG: disconnection: session time: 0:00:00.622 user=postgres
> database=postgres host=::1 port=40924
> [2020-11-16 12:05:44.509 UTC] p=24120:1@ c=@/: LOG: shutting down
> [2020-11-16 12:05:45.621 UTC] p=24109:11@ c=@/: LOG: database system is
> shut down
>
> --vm03 patroni log
> Nov 16 12:03:51 eu4-baas-patroni-cluster-vm03 patroni[27664]: {"time":
> "2020-11-16 12:03:51,436", "name": "patroni", "process": 27664, "thread":
> 140371489281856, "level": "INFO", "message": "following a different leader
> because i am not the healthiest node"}
>
> Nov 16 12:05:51 eu4-baas-patroni-cluster-vm03 patroni[27664]:
> pg_controldata: could not read file
> "/var/lib/pgsql/11/data/global/pg_control": read 0 of 288
> Nov 16 12:05:51 eu4-baas-patroni-cluster-vm03 patroni[27664]: {"time":
> "2020-11-16 12:05:51,016", "name": "patroni.postgresql", "process": 27664,
> "thread": 140371489281856, "level": "ERROR", "message": "Error when calling
> pg_controldata"}
> Nov 16 12:05:51 eu4-baas-patroni-cluster-vm03 patroni[27664]: Traceback
> (most recent call last):
> Nov 16 12:05:51 eu4-baas-patroni-cluster-vm03 patroni[27664]: File
> "/usr/local/lib/python3.6/site-packages/patroni/postgresql/__init__.py",
> line 718, in controldata
> Nov 16 12:05:51 eu4-baas-patroni-cluster-vm03 patroni[27664]: data =
> subprocess.check_output([self.pgcommand('pg_controldata'), self._data_dir],
> env=env)
> Nov 16 12:05:51 eu4-baas-patroni-cluster-vm03 patroni[27664]: File
> "/usr/lib64/python3.6/subprocess.py", line 356, in check_output
> Nov 16 12:05:51 eu4-baas-patroni-cluster-vm03 patroni[27664]:
> **kwargs).stdout
> Nov 16 12:05:51 eu4-baas-patroni-cluster-vm03 patroni[27664]: File
> "/usr/lib64/python3.6/subprocess.py", line 438, in run
> Nov 16 12:05:51 eu4-baas-patroni-cluster-vm03 patroni[27664]:
> output=stdout, stderr=stderr)
> Nov 16 12:05:51 eu4-baas-patroni-cluster-vm03 patroni[27664]:
> subprocess.CalledProcessError: Command
> '['/usr/pgsql-11/bin/pg_controldata', '/var/lib/pgsql/11/data']' returned
> non-zero exit status 1.
> Nov 16 12:05:51 eu4-baas-patroni-cluster-vm03 patroni[27664]: {"time":
> "2020-11-16 12:05:51,018", "name": "patroni.ha", "process": 27664,
> "thread": 140371489281856, "level": "INFO", "message": "pg_controldata:
> Nov 16 12:05:51 eu4-baas-patroni-cluster-vm03 patroni[27664]: "}
> Nov 16 12:05:51 eu4-baas-patroni-cluster-vm03 patroni[27664]: {"time":
> "2020-11-16 12:05:51,022", "name": "patroni.ha", "process": 27664,
> "thread": 140371489281856, "level": "INFO", "message": "Lock owner:
> eu4-baas-patroni-cluster-vm02; I am eu4-baas-patroni-cluster-vm03"}
> Nov 16 12:05:51 eu4-baas-patroni-cluster-vm03 patroni[27664]: {"time":
> "2020-11-16 12:05:51,022", "name": "patroni.ha", "process": 27664,
> "thread": 140371489281856, "level": "INFO", "message": "Lock owner:
> eu4-baas-patroni-cluster-vm02; I am eu4-baas-patroni-cluster-vm03"}
> Nov 16 12:05:51 eu4-baas-patroni-cluster-vm03 patroni[27664]: {"time":
> "2020-11-16 12:05:51,031", "name": "patroni.dcs.consul", "process": 27664,
> "thread": 140371489281856, "level": "INFO", "message": "Deregister service
> pgcluster11/eu4-baas-patroni-cluster-vm03"}
> Nov 16 12:05:51 eu4-baas-patroni-cluster-vm03 patroni[27664]: {"time":
> "2020-11-16 12:05:51,048", "name": "patroni", "process": 27664, "thread":
> 140371489281856, "level": "INFO", "message": "starting as a secondary"}
> Nov 16 12:05:51 eu4-baas-patroni-cluster-vm03 patroni[27664]: {"time":
> "2020-11-16 12:05:51,049", "name": "patroni.postgresql.connection",
> "process": 27664, "thread": 140371151345408, "level": "INFO", "message":
> "closed patroni connection to the postgresql cluster"}
>
>
> Best regards
> Oliver Sun
>
>
>
>
>
>
>
>

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Tom Lane 2020-11-20 06:17:58 Re: received immediate shutdown request caused cluster failover
Previous Message Steve Baldwin 2020-11-19 23:45:02 Locking and postgres_fdw extension