Re: Unintended restart after recovery error

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc: Antonin Houska <ah(at)cybertec(dot)at>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Unintended restart after recovery error
Date: 2014-11-17 15:46:58
Message-ID: CA+TgmoYi7DwEP+EhaMW-sYfNLu2B0Bh-yz1PeWkNV2s7_0w8bA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Nov 13, 2014 at 10:59 PM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
> 442231d7f71764b8c628044e7ce2225f9aa43b6 introduced the latter rule
> for hot-standby case. Maybe *during crash recovery* (i.e., hot standby
> should not be enabled) it's better to treat the crash of startup process as
> a catastrophic crash.

Maybe, but why, specifically? If the startup process failed
internally, it's probably because it hit an error during the replay of
some WAL record. So if we restart it, it will back up to the previous
checkpoint or restartpoint, replay the same WAL records as before, and
die again in the same spot. We don't want it to sit there and do that
forever in an infinite loop, so it makes sense to kill the whole
server.

But if the startup process was killed off because the checkpointer
croaked, that logic doesn't necessarily apply. There's no reason to
assume that the replay of a particular WAL record was what killed the
checkpointer; in fact, it seems like the odds are against it. So it
seems right to fall back to our general principle of restarting the
server and hoping that's enough to get things back on line.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2014-11-17 16:01:06 Re: using custom scan nodes to prototype parallel sequential scan
Previous Message Heikki Linnakangas 2014-11-17 15:40:31 BRIN page type identifier