Re: strange parallel query behavior after OOM crashes

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: strange parallel query behavior after OOM crashes
Date: 2017-04-05 10:43:05
Message-ID: da03d726-e87d-ef5e-396d-188549e6cc23@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 04/05/2017 12:36 PM, Kuntal Ghosh wrote:
> On Wed, Apr 5, 2017 at 3:07 PM, Tomas Vondra
> <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>>
>>
>> On 04/05/2017 09:05 AM, Kuntal Ghosh wrote:
>>>
>>> AFAICU, during crash recovery, we wait for all non-syslogger children
>>> to exit, then reset shmem(call BackgroundWorkerShmemInit) and perform
>>> StartupDataBase. While starting the startup process we check if any
>>> bgworker is scheduled for a restart. If a bgworker is crashed and not
>>> meant for a restart(parallel worker in our case), we call
>>> ForgetBackgroundWorker() to discard it.
>>>
>>
>> OK, so essentially we reset the counters, getting
>>
>> parallel_register_count = 0
>> parallel_terminate_count = 0
>>
>> and then ForgetBackgroundWworker() runs and increments the terminate_count,
>> breaking the logic?
>>
>> And you're using (parallel_register_count > 0) to detect whether we're still
>> in the init phase or not. Hmmm.
>>
> Yes. But, as Robert suggested up in the thread, we should not use
> (parallel_register_count = 0) as an alternative to define a bgworker
> crash. Hence, I've added an argument named 'wasCrashed' in
> ForgetBackgroundWorker to indicate a bgworker crash.
>

Sure, and I agree that having an explicit flag is the right solution.
I'm just trying to make sure I understand what's happening.

>>
>> The comment says that the counters are allowed to overflow, i.e. after a
>> long uptime you might get these values
>>
>> parallel_register_count = UINT_MAX + 1 = 1
>> parallel_terminate_count = UINT_MAX
>>
>> which is fine, because the C handles the overflow during subtraction and so
>>
>> parallel_register_count - parallel_terminate_count = 1
>>
>> But the assert is not doing subtraction, it's comparing the values directly:
>>
>> Assert(parallel_register_count >= parallel_terminate_count);
>>
>> and the (perfectly valid) values trivially violate this comparison.
>>
> Thanks for the explanation. So, we can't use the above assert
> statement. Even the following assert statement will not be helpful:
> Assert(parallel_register_count - parallel_terminate_count >= 0);
> Because, it'll fail to track the scenario when parallel_register_count
> is not overflowed, still less than parallel_terminate_count. :(
>

Actually, that assert would work, because C does handle overflows on
uint values during subtraction just fine. That is,

(UINT_MAX+x) - UINT_MAX = x

Also, the comment about overflows before BackgroundWorkerArray claims
this is the case.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Kuntal Ghosh 2017-04-05 11:09:40 Re: strange parallel query behavior after OOM crashes
Previous Message Kuntal Ghosh 2017-04-05 10:36:40 Re: strange parallel query behavior after OOM crashes