Re: strange parallel query behavior after OOM crashes

From: Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com>
To: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
Cc: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>
Subject: Re: strange parallel query behavior after OOM crashes
Date: 2017-03-30 20:35:31
Message-ID: CAGz5QCL6h-cZS9v=yrbd3FZDDGpXdyMw4icgbx3eE6F2P_eOVA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Mar 31, 2017 at 12:32 AM, Thomas Munro
<thomas(dot)munro(at)enterprisedb(dot)com> wrote:
> On Fri, Mar 31, 2017 at 7:38 AM, Tomas Vondra
> <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>> Hi,
>>
>> While doing some benchmarking, I've ran into a fairly strange issue with OOM
>> breaking LaunchParallelWorkers() after the restart. What I see happening is
>> this:
>>
>> 1) a query is executed, and at the end of LaunchParallelWorkers we get
>>
>> nworkers=8 nworkers_launched=8
>>
>> 2) the query does a Hash Aggregate, but ends up eating much more memory due
>> to n_distinct underestimate (see [1] from 2015 for details), and gets killed
>> by OOM
>>
>> 3) the server restarts, the query is executed again, but this time we get in
>> LaunchParallelWorkers
>>
>> nworkers=8 nworkers_launched=0
>>
>> There's nothing else running on the server, and there definitely should be
>> free parallel workers.
>>
>> 4) The query gets killed again, and on the next execution we get
>>
>> nworkers=8 nworkers_launched=8
>>
>> again, although not always. I wonder whether the exact impact depends on OOM
>> killing the leader or worker, for example.
>
> I don't know what's going on but I think I have seen this once or
> twice myself while hacking on test code that crashed. I wonder if the
> DSM_CREATE_NULL_IF_MAXSEGMENTS case could be being triggered because
> the DSM control is somehow confused?
>
I think I've run into the same problem while working on parallelizing
plans containing InitPlans. You can reproduce that scenario by
following steps:

1. Put an Assert(0) in ParallelQueryMain(), start server and execute
any parallel query.
In LaunchParallelWorkers, you can see
nworkers = n nworkers_launched = n (n>0)
But, all the workers will crash because of the assert statement.
2. the server restarts automatically, initialize
BackgroundWorkerData->parallel_register_count and
BackgroundWorkerData->parallel_terminate_count in the shared memory.
After that, it calls ForgetBackgroundWorker and it increments
parallel_terminate_count. In LaunchParallelWorkers, we have the
following condition:
if ((BackgroundWorkerData->parallel_register_count -
BackgroundWorkerData->parallel_terminate_count) >=
max_parallel_workers)
DO NOT launch any parallel worker.
Hence, nworkers = n nworkers_launched = 0.

I thought because of my stupid mistake the parallel worker is
crashing, so, this is supposed to happen. Sorry for that.

--
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2017-03-30 20:45:55 REFERENCES privilege should not be symmetric (was Re: Postgres Permissions Article)
Previous Message Stephen Frost 2017-03-30 20:29:19 Re: [PATCH] Reduce src/test/recovery verbosity