Re: strange parallel query behavior after OOM crashes

From: Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: strange parallel query behavior after OOM crashes
Date: 2017-04-06 09:34:13
Message-ID: CAGz5QCLP9kdDHk=zBUs-5+V1AGARXPFOi=AA1Z9JxRpnt0rmqQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Apr 5, 2017 at 6:49 PM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> On Wed, Apr 5, 2017 at 12:35 PM, Kuntal Ghosh
> <kuntalghosh(dot)2007(at)gmail(dot)com> wrote:
>> On Tue, Apr 4, 2017 at 11:22 PM, Tomas Vondra
>>> I'm probably missing something, but I don't quite understand how these
>>> values actually survive the crash. I mean, what I observed is OOM followed
>>> by a restart, so shouldn't BackgroundWorkerShmemInit() simply reset the
>>> values back to 0? Or do we call ForgetBackgroundWorker() after the crash for
>>> some reason?
>> AFAICU, during crash recovery, we wait for all non-syslogger children
>> to exit, then reset shmem(call BackgroundWorkerShmemInit) and perform
>> StartupDataBase. While starting the startup process we check if any
>> bgworker is scheduled for a restart.
>>
>
> In general, your theory appears right, but can you check how it
> behaves in standby server because there is a difference in how the
> startup process behaves during master and standby startup? In master,
> it stops after recovery whereas in standby it will keep on running to
> receive WAL.
>
While performing StartupDatabase, both master and standby server
behave in similar way till postmaster spawns startup process.
In master, startup process completes its job and dies. As a result,
reaper is called which in turn calls maybe_start_bgworker().
In standby, after getting a valid snapshot, startup process sends
postmaster a signal to enable connections. Signal handler in
postmaster calls maybe_start_bgworker().
In maybe_start_bgworker(), if we find a crashed bgworker(crashed_at !=
0) with a NEVER RESTART flag, we call ForgetBackgroundWorker().to
forget the bgworker process.

I've attached the patch for adding an argument in
ForgetBackgroundWorker() to indicate a crashed situation. Based on
that, we can take the necessary actions. I've not included the Assert
statement in this patch.

--
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

Attachment Content-Type Size
0001-Fix-parallel-worker-counts-after-a-crash_v1.patch binary/octet-stream 3.8 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message David Rowley 2017-04-06 09:55:43 Re: [COMMITTERS] pgsql: Collect and use multi-column dependency stats
Previous Message Ashutosh Bapat 2017-04-06 09:05:52 No-op case in ExecEvalConvertRowtype