BUG #9420: WAL-Sender repeatedly killed

From: proels(at)web(dot)de
To: pgsql-bugs(at)postgresql(dot)org
Subject: BUG #9420: WAL-Sender repeatedly killed
Date: 2014-03-03 12:53:42
Message-ID: 20140303125342.2159.79464@wrigleys.postgresql.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

The following bug has been logged on the website:

Bug reference: 9420
Logged by: Stefan Pröls
Email address: proels(at)web(dot)de
PostgreSQL version: 9.2.7
Operating system: opensuse 11.4
Description:

I'm unsure whether this is really a bug, but at least the log messages don't
look like expected behavior.

We are running async. streaming replication between 2 PostgreSQL 9.2.7
servers. Last night the slave failed to reboot nicely because of a
filesystem inconsistency which required remote hands. So there was a delay
of about 30 hours until the slave was up an running again. Thus, it had to
catch up quite some replication delay. It did succeed doing so -- however,
according to the log messages the WAL sender process was killed several
times during this period. From the master's log:

2014-03-03 13:02:30 CET [unbekannt] standby LOG: breche WAL-Sender-Prozess
ab wegen Zeitüberschreitung bei der Replikation
2014-03-03 13:05:08 CET [unbekannt] standby LOG: breche WAL-Sender-Prozess
ab wegen Zeitüberschreitung bei der Replikation
[several more times]

>From the slave's log:

2014-03-03 12:47:18 CET FATAL: konnte keine Daten an den WAL-Stream
senden: Server beendete die Verbindung unerwartet
Das heißt wahrscheinlich, dass der Server abnormal beendete
bevor oder während die Anweisung bearbeitet wurde.

2014-03-03 13:00:34 CET LOG: ungültige magische Zahl 0000 in Logdatei 17,
Segment 111, Offset 3538944
2014-03-03 13:00:35 CET LOG: Streaming-Replikation hat erfolgreich mit
primärem Server verbunden
2014-03-03 13:03:07 CET FATAL: konnte keine Daten an den WAL-Stream
senden: Server beendete die Verbindung unerwartet
Das heißt wahrscheinlich, dass der Server abnormal beendete
bevor oder während die Anweisung bearbeitet wurde.

2014-03-03 13:04:04 CET LOG: ungültige magische Zahl 0000 in Logdatei 17,
Segment 185, Offset 5111808
2014-03-03 13:04:04 CET LOG: Streaming-Replikation hat erfolgreich mit
primärem Server verbunden
[also repeated serverali more times]

It appears the slave had trouble keeping up with the amount of data sent by
the master and, therefore, the wal-sender was repeatedly killed until the
slave finally catched up. I don't know where the "invalid magic number 0000"
comes from, though.

Note that the master has much faster disks (SSDs) than the slave
(SATA-HDDs), so some congestion on the master is likely to arise when the
master streams out more data than the slave can keep up with.

Browse pgsql-bugs by date

  From Date Subject
Next Message muhammadk 2014-03-03 13:52:15 BUG #9422: PostgreSQL Zero downtime during deployement
Previous Message Sandro Santilli 2014-03-03 08:31:09 Re: uninterruptable regexp_replace in 9.2 and 9.3