Re: Instability in select_parallel regression test

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Instability in select_parallel regression test
Date: 2017-02-19 13:20:59
Message-ID: CAA4eK1JdQ=dsfYpGZCF0a1zgTebSujYgHguAmv2deGGVkirn3w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sun, Feb 19, 2017 at 5:54 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Sun, Feb 19, 2017 at 2:17 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>> Such a change can be made, but as I pointed out in the part you didn't
>> quote, there are reasons to wonder whether that will be a constructive
>> change in real life even if it's better for the regression tests.
>> Optimizing PostgreSQL for the use case of running regression tests in
>> the buildfarm at the expense of other use cases wouldn't be very
>> smart. Maybe such a change is better in real-world applications too,
>> but that deserves at least a little bit of thought and substantive
>> discussion.
>
> Rewind. Wait a minute. Looking at this code again, it looks like
> we're supposed to ALREADY BE DOING THIS.
>
> DestroyParallelContext() calls WaitForParallelWorkersToExit() which
> calls WaitForBackgroundWorkerShutdown() for each worker. That
> function returns only when the postmaster dies (which causes an error
> with that specific complaint) or when GetBackgroundWorkerPid() sets
> the status to BGWH_STOPPED. GetBackgroundWorkerPid() only returns
> BGWH_STOPPED when either (a) handle->generation != slot->generation
> (meaning that the slot got reused, and therefore must have been freed)
> or when (b) slot->pid == 0. The pid only gets set to 0 in
> BackgroundWorkerStateChange() when slot->terminate is set, or in
> ReportBackgroundWorkerPID() when it's called from
> CleanupBackgroundWorker. So this function should not be returning
> until after all workers have actually exited.
>

Yeah, I have also noticed this point and was thinking of the way to
close this gap.

> However, it looks like there's a race condition here, because the slot
> doesn't get freed up at the same time that the PID gets set to 0.
> That actually happens later, when the postmaster calls
> maybe_start_bgworker() or DetermineSleepTime() and one of those
> functions calls ForgetBackgroundWorker(). We could tighten this up by
> changing CleanupBackgroundWorker() to also call
> ForgetBackgroundWorker() immediately after calling
> ReportBackgroundWorker() if rw->rw_terminate ||
> rw->rw_worker.bgw_restart_time == BGW_NEVER_RESTART. If we do that
> BEFORE sending the notification to the starting process, that closes
> this hole. Almost.
>

To close the remaining gap, don't you think we can check slot->in_use
flag when generation number for handle and slot are same.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2017-02-19 13:42:38 Re: Avoiding OOM in a hash join with many duplicate inner keys
Previous Message Michael Paquier 2017-02-19 13:07:05 Re: SCRAM authentication, take three