BUG #15360: Cascade slave cannot continue replication without manual restart after master server crash/restart.

From: PG Bug reporting form <noreply(at)postgresql(dot)org>
To: pgsql-bugs(at)lists(dot)postgresql(dot)org
Cc: maxim(dot)boguk(at)gmail(dot)com
Subject: BUG #15360: Cascade slave cannot continue replication without manual restart after master server crash/restart.
Date: 2018-09-01 06:58:24
Message-ID: 153578510446.31898.8752275961171482070@wrigleys.postgresql.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

The following bug has been logged on the website:

Bug reference: 15360
Logged by: Maxim Boguk
Email address: maxim(dot)boguk(at)gmail(dot)com
PostgreSQL version: 9.6.10
Operating system: Linux Ubuntu
Description:

Hi,

I have the following cascading replication configuration (using only
streaming replication without wal archive):

master db -> main stanby replica db -> multiple readonly replicas.
All using 9.6.10 Postgresql.

For unknown (yet) reason master server had been stuck (no postgresql but
server as whole) and manually restarted.
After it very strange thing happen:
The main stanby replica connected to restarted master successfully and
continued replication, but every readonly replica connected to main standby
didn't start replication process until manual restart.
After manual restart - they continue replication without any further
complains.

Readonly replica have the next errors in the log:
second cascade slave:
2018-09-01 02:02:59 MSK 28822 @ from [vxid: txid:0] [] DETAIL: last
completed transaction was at log time 2018-09-01 02:02:59.501032+03
... different production selects logged...
2018-09-01 03:36:55 MSK 28819 @ from [vxid:1/0 txid:0] [] LOG: invalid
contrecord length 2431 at D9ED/EAA2FB00
2018-09-01 03:36:55 MSK 30264 @ from [vxid: txid:0] [] FATAL: terminating
walreceiver process due to administrator command
2018-09-01 03:36:55 MSK 28819 @ from [vxid:1/0 txid:0] [] LOG: invalid
contrecord length 2431 at D9ED/EAA2FB00
2018-09-01 03:36:55 MSK 28819 @ from [vxid:1/0 txid:0] [] LOG: invalid
contrecord length 2431 at D9ED/EAA2FB00
2018-09-01 03:36:55 MSK 28819 @ from [vxid:1/0 txid:0] [] LOG: invalid
contrecord length 2431 at D9ED/EAA2FB00
...
... and so on until manual restart issued
2018-09-01 04:05:12 MSK 28815 @ from [vxid: txid:0] [] LOG: received fast
shutdown request
2018-09-01 04:05:12 MSK 28815 @ from [vxid: txid:0] [] LOG: aborting any
active transactions
...
2018-09-01 04:05:12 MSK 28815 @ from [vxid: txid:0] [] LOG: database
system is shut down
2018-09-01 04:05:15 MSK 39686 @ from [vxid: txid:0] [] LOG: database
system was shut down in recovery at 2018-09-01 04:05:12 MSK
2018-09-01 04:05:15 MSK 39686 @ from [vxid: txid:0] [] LOG: entering
standby mode
2018-09-01 04:05:15 MSK 39686 @ from [vxid:1/0 txid:0] [] LOG: redo starts
at D9EB/18907C38
2018-09-01 04:07:25 MSK 39686 @ from [vxid:1/0 txid:0] [] LOG: consistent
recovery state reached at D9ED/E1FB7A60
2018-09-01 04:07:26 MSK 39686 @ from [vxid:1/0 txid:0] [] LOG: invalid
contrecord length 2431 at D9ED/EAA2FB00
2018-09-01 04:07:26 MSK 40123 @ from [vxid: txid:0] [] LOG: started
streaming WAL from primary at D9ED/EA000000 on timeline 1
... no errors and all ok after

In the same time main standby replica have the following lines in log:
2018-09-01 02:02:59 MSK 2558 @ from [vxid: txid:0] [] DETAIL: last
completed transaction was at log time 2018-09-01 02:02:59.377119+03
...
2018-09-01 03:18:36 MSK 2971 @ from [vxid: txid:0] [] FATAL: terminating
walreceiver due to timeout
2018-09-01 03:28:31 MSK 4754 @ from [vxid: txid:0] [] FATAL: could not
connect to the primary server: could not connect to server: No route to
host
...
2018-09-01 03:35:10 MSK 6025 @ from [vxid: txid:0] [] FATAL: could not
connect to the primary server: FATAL: the database system is starting up
...
2018-09-01 03:36:55 MSK 6276 @ from [vxid: txid:0] [] LOG: started
streaming WAL from primary at D9ED/EA000000 on timeline 1
2018-09-01 03:36:55 MSK 2553 @ from [vxid:1/0 txid:0] [] LOG: invalid
contrecord length 2431 at D9ED/EAA2FB00
2018-09-01 03:36:55 MSK 6276 @ from [vxid: txid:0] [] FATAL: terminating
walreceiver process due to administrator command
2018-09-01 03:36:55 MSK 2553 @ from [vxid:1/0 txid:0] [] LOG: invalid
resource manager ID 101 at D9ED/EAA76618
2018-09-01 03:36:55 MSK 6277 @ from [vxid: txid:0] [] LOG: started
streaming WAL from primary at D9ED/EA000000 on timeline 1
... RO replicas restarts now
2018-09-01 03:50:39 MSK 8548 replica(at)[unknown] from 172.16.18.13 [vxid:222/0
txid:0] [idle] LOG: received replication command: IDENTIFY_SYSTEM
2018-09-01 03:50:39 MSK 8548 replica(at)[unknown] from 172.16.18.13 [vxid:222/0
txid:0] [idle] LOG: received replication command: START_REPLICATION
D9ED/EA000000 TIMELINE 1
... no errors and all ok after

Questions: is this behavior expected or something fishy going on here?
I never seen such situation with cascade streaming replication before.

Kind Regards,
Maxim

Browse pgsql-bugs by date

  From Date Subject
Next Message PG Bug reporting form 2018-09-01 17:12:10 BUG #15361: Add column if not exists create duplicate constraint
Previous Message Andres Freund 2018-09-01 00:12:00 Re: BUG #15270: __builtin_isinf conflict when building using clang