Quick Links

Re: How to solve the problem of one backend process crashing and causing other processes to restart?

From:	Laurenz Albe <laurenz(dot)albe(at)cybertec(dot)at>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, yuansong <yyuansong(at)126(dot)com>
Cc:	pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject:	Re: How to solve the problem of one backend process crashing and causing other processes to restart?
Date:	2023-11-13 05:53:29
Message-ID:	cd4089fc9b0901584193cbced59575a1737cdea4.camel@cybertec.at
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Sun, 2023-11-12 at 21:55 -0500, Tom Lane wrote:
> yuansong <yyuansong(at)126(dot)com> writes:
> > In PostgreSQL, when a backend process crashes, it can cause other backend
> > processes to also require a restart, primarily to ensure data consistency.
> > I understand that the correct approach is to analyze and identify the
> > cause of the crash and resolve it. However, it is also important to be
> > able to handle a backend process crash without affecting the operation of
> > other processes, thus minimizing the scope of negative impact and
> > improving availability. To achieve this goal, could we mimic the Oracle
> > process by introducing a "pmon" process dedicated to rolling back crashed
> > process transactions and performing resource cleanup? I wonder if anyone
> > has attempted such a strategy or if there have been previous discussions
> > on this topic.
>
> The reason we force a database-wide restart is that there's no way to
> be certain that the crashed process didn't corrupt anything in shared
> memory. (Even with the forced restart, there's a window where bad
> data could reach disk before we kill off the other processes that
> might write it. But at least it's a short window.) "Corruption"
> here doesn't just involve bad data placed into disk buffers; more
> often it's things like unreleased locks, which would block other
> processes indefinitely.
>
> I seriously doubt that anything like what you're describing
> could be made reliable enough to be acceptable. "Oracle does
> it like this" isn't a counter-argument: they have a much different
> (and non-extensible) architecture, and they also have an army of
> programmers to deal with minutiae like undoing resource acquisition.
> Even with that, you'd have to wonder about the number of bugs
> existing in such necessarily-poorly-tested code paths.

Yes.
I think that PostgreSQL's approach is superior: rather than investing in
code to mitigate the impact of data corruption caused by a crash, invest
in quality code that doesn't crash in the first place.

Euphemistically naming a crash "ORA-600 error" seems to be part of
their strategy.

Yours,
Laurenz Albe

In response to

Re: How to solve the problem of one backend process crashing and causing other processes to restart? at 2023-11-13 02:55:48 from Tom Lane

Responses

Re:Re: How to solve the problem of one backend process crashing and causing other processes to restart? at 2023-11-13 09:13:20 from yuansong
Re: How to solve the problem of one backend process crashing and causing other processes to restart? at 2023-11-13 16:57:56 from Joe Conway

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Michael Paquier	2023-11-13 05:55:19	Re: Add new option 'all' to pg_stat_reset_shared()
Previous Message	Fujii.Yuki@df.MitsubishiElectric.co.jp	2023-11-13 05:48:47	RE: Partial aggregates pushdown