Re: strange parallel query behavior after OOM crashes

From: Noah Misch <noah(at)leadboat(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: strange parallel query behavior after OOM crashes
Date: 2017-04-10 03:18:36
Message-ID: 20170410031836.GC2845004@tornado.leadboat.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Apr 06, 2017 at 03:04:13PM +0530, Kuntal Ghosh wrote:
> On Wed, Apr 5, 2017 at 6:49 PM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > On Wed, Apr 5, 2017 at 12:35 PM, Kuntal Ghosh
> > <kuntalghosh(dot)2007(at)gmail(dot)com> wrote:
> >> On Tue, Apr 4, 2017 at 11:22 PM, Tomas Vondra
> >>> I'm probably missing something, but I don't quite understand how these
> >>> values actually survive the crash. I mean, what I observed is OOM followed
> >>> by a restart, so shouldn't BackgroundWorkerShmemInit() simply reset the
> >>> values back to 0? Or do we call ForgetBackgroundWorker() after the crash for
> >>> some reason?
> >> AFAICU, during crash recovery, we wait for all non-syslogger children
> >> to exit, then reset shmem(call BackgroundWorkerShmemInit) and perform
> >> StartupDataBase. While starting the startup process we check if any
> >> bgworker is scheduled for a restart.
> >>
> >
> > In general, your theory appears right, but can you check how it
> > behaves in standby server because there is a difference in how the
> > startup process behaves during master and standby startup? In master,
> > it stops after recovery whereas in standby it will keep on running to
> > receive WAL.
> >
> While performing StartupDatabase, both master and standby server
> behave in similar way till postmaster spawns startup process.
> In master, startup process completes its job and dies. As a result,
> reaper is called which in turn calls maybe_start_bgworker().
> In standby, after getting a valid snapshot, startup process sends
> postmaster a signal to enable connections. Signal handler in
> postmaster calls maybe_start_bgworker().
> In maybe_start_bgworker(), if we find a crashed bgworker(crashed_at !=
> 0) with a NEVER RESTART flag, we call ForgetBackgroundWorker().to
> forget the bgworker process.
>
> I've attached the patch for adding an argument in
> ForgetBackgroundWorker() to indicate a crashed situation. Based on
> that, we can take the necessary actions. I've not included the Assert
> statement in this patch.

[Action required within three days. This is a generic notification.]

The above-described topic is currently a PostgreSQL 10 open item. Robert,
since you committed the patch believed to have created it, you own this open
item. If some other commit is more relevant or if this does not belong as a
v10 open item, please let us know. Otherwise, please observe the policy on
open item ownership[1] and send a status update within three calendar days of
this message. Include a date for your subsequent status update. Testers may
discover new open items at any time, and I want to plan to get them all fixed
well in advance of shipping v10. Consequently, I will appreciate your efforts
toward speedy resolution. Thanks.

[1] https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2017-04-10 03:20:17 Re: logical replication and SIGHUP
Previous Message Noah Misch 2017-04-10 03:17:34 Re: Problem in Parallel Bitmap Heap Scan?