Re: strange parallel query behavior after OOM crashes

From: Neha Khatri <nehakhatri5(at)gmail(dot)com>
To: Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com>
Cc: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>
Subject: Re: strange parallel query behavior after OOM crashes
Date: 2017-03-31 00:13:17
Message-ID: CAFO0U+874hTAooRdPgvE7f0bPc-QfUTywLS1baM8cMp-tSjvTw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Mar 31, 2017 at 8:29 AM, Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com>
wrote:

> On Fri, Mar 31, 2017 at 2:05 AM, Kuntal Ghosh
> <kuntalghosh(dot)2007(at)gmail(dot)com> wrote:
> >
> > 1. Put an Assert(0) in ParallelQueryMain(), start server and execute
> > any parallel query.
> > In LaunchParallelWorkers, you can see
> > nworkers = n nworkers_launched = n (n>0)
> > But, all the workers will crash because of the assert statement.
> > 2. the server restarts automatically, initialize
> > BackgroundWorkerData->parallel_register_count and
> > BackgroundWorkerData->parallel_terminate_count in the shared memory.
> > After that, it calls ForgetBackgroundWorker and it increments
> > parallel_terminate_count. In LaunchParallelWorkers, we have the
> > following condition:
> > if ((BackgroundWorkerData->parallel_register_count -
> > BackgroundWorkerData->parallel_terminate_count) >=
> > max_parallel_workers)
> > DO NOT launch any parallel worker.
> > Hence, nworkers = n nworkers_launched = 0.
> parallel_register_count and parallel_terminate_count, both are
> unsigned integer. So, whenever the difference is negative, it'll be a
> well-defined unsigned integer and certainly much larger than
> max_parallel_workers. Hence, no workers will be launched. I've
> attached a patch to fix this.

The current explanation of active number of parallel workers is:

* The active
* number of parallel workers is the number of registered workers minus the
* terminated ones.

In the situations like you mentioned above, this formula can give negative
number for active parallel workers. However a negative number for active
parallel workers does not make any sense.

I feel it would be better to explain in code that in what situations, the
formula
can generate a negative result and what that means.

Regards,
Neha

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Petr Jelinek 2017-03-31 00:53:16 Re: Somebody has not thought through subscription locking considerations
Previous Message David Rowley 2017-03-30 23:28:45 Something broken around FDW connection close