BUG? Slave don't reconnect to the master

From: Олег Самойлов <splarv(at)ya(dot)ru>
To: pgsql-general <pgsql-general(at)postgresql(dot)org>
Subject: BUG? Slave don't reconnect to the master
Date: 2020-08-18 10:48:41
Message-ID: 60590EC6-4062-4F25-A49C-3948ED2A7D47@ya.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hi all.

I found some strange behaviour of postgres, which I recognise as a bug. First of all, let me explain situation.

I created a "test bed" (not sure how to call it right), to test high availability clusters based on Pacemaker and PostgreSQL. The test bed consist of 12 virtual machines (on VirtualBox) runing on a MacBook Pro and formed 4 HA clusters with different structure. And all 4 HA cluster constantly tested in loop: simulated failures with different nature, waited for rising fall-over, fixing, and so on. For simplicity I'll explain only one HA cluster. This is 3 virtual machines, with master on one, and sync and async slaves on other. The PostgreSQL service is provided by float IPs pointed to working master and slaves. Slaves are connected to the master float IP too. When the pacemaker detects a failure, for instance, on the master, it promote a master on other node with lowest latency WAL and switches float IPs, so the third node keeping be a sync slave. My company decided to open this project as an open source, now I am finishing formality.

Almost works fine, but sometimes, rather rare, I detected that a slave don't reconnect to the new master after a failure. First case is PostgreSQL-STOP, when I `kill` by STOP signal postgres on the master to simulate freeze. The slave don't reconnect to the new master with errors in log:

18:02:56.236 [3154] FATAL: terminating walreceiver due to timeout
18:02:56.237 [1421] LOG: record with incorrect prev-link 0/1600DDE8 at 0/1A00DE10

What is strange that error about incorrect WAL is risen after the termination of connection. Well, this can be workarouned by turning off wal receiver timeout. Now PostgreSQL-STOP works fine, but the problem is still exists with other test. ForkBomb simulates an out of memory situation. In this case a slave sometimes don't reconnect to the new master too, with errors in log:

10:09:43.99 [1417] FATAL: could not receive data from WAL stream: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
10:09:43.992 [1413] LOG: invalid record length at 0/D8014278: wanted 24, got 0

The last error message (last row in log) was observed different, btw.

What I expect as right behaviour. The PostgreSQL slave must reconnect to the master IP (float IP) after the wal_retrieve_retry_interval.

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Ron 2020-08-18 11:28:00 Re: Point in time recovery
Previous Message Daulat Ram 2020-08-18 10:10:58 Point in time recovery