Re: strange parallel query behavior after OOM crashes

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>, Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com>
Cc: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: strange parallel query behavior after OOM crashes
Date: 2017-04-04 17:52:29
Message-ID: 73b84052-5dfe-0c40-4212-cfa96a1c6027@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 04/04/2017 06:52 PM, Robert Haas wrote:
> On Mon, Apr 3, 2017 at 6:08 AM, Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com> wrote:
>> On Fri, Mar 31, 2017 at 6:50 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>>> On Thu, Mar 30, 2017 at 4:35 PM, Kuntal Ghosh
>>> <kuntalghosh(dot)2007(at)gmail(dot)com> wrote:
>>>> 2. the server restarts automatically, initialize
>>>> BackgroundWorkerData->parallel_register_count and
>>>> BackgroundWorkerData->parallel_terminate_count in the shared memory.
>>>> After that, it calls ForgetBackgroundWorker and it increments
>>>> parallel_terminate_count.
>>>
>>> Hmm. So this seems like the root of the problem. Presumably those
>>> things need to be reset AFTER forgetting any background workers from
>>> before the crash.
>>>
>> IMHO, the fix would be not to increase the terminated parallel worker
>> count whenever ForgetBackgroundWorker is called due to a bgworker
>> crash. I've attached a patch for the same. PFA.
>
> While I'm not opposed to that approach, I don't think this is a good
> way to implement it. If you want to pass an explicit flag to
> ForgetBackgroundWorker telling it whether or not it should performing
> the increment, fine. But with what you've got here, you're
> essentially relying on "spooky action at a distance". It would be
> easy for future code changes to break this, not realizing that
> somebody's got a hard dependency on 0 having a specific meaning.
>

I'm probably missing something, but I don't quite understand how these
values actually survive the crash. I mean, what I observed is OOM
followed by a restart, so shouldn't BackgroundWorkerShmemInit() simply
reset the values back to 0? Or do we call ForgetBackgroundWorker() after
the crash for some reason?

In any case, the comment right before BackgroundWorkerArray says this:

* These counters can of course overflow, but it's not important here
* since the subtraction will still give the right number.

which means that this assert

+ Assert(BackgroundWorkerData->parallel_register_count >=
+ BackgroundWorkerData->parallel_terminate_count);

is outright broken, just like any other attempts to rely on simple
comparisons of these two counters, no?

> BTW, if this isn't on the open items list, it should be. It's
> presumably the fault of b460f5d6693103076dc554aa7cbb96e1e53074f9.
>

Unrelated, but perhaps we should also add this to open items:

https://www.postgresql.org/message-id/flat/57bbc52c-5d40-bb80-2992-7e1027d0b67c%402ndquadrant.com

(although that's more a 9.6 material).

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2017-04-04 17:54:19 Re: Compiler warning in costsize.c
Previous Message Pavan Deolasee 2017-04-04 17:48:44 Re: Patch: Write Amplification Reduction Method (WARM)