Re: Broken order-of-operations in parallel query latch manipulation

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Broken order-of-operations in parallel query latch manipulation
Date: 2016-08-02 10:21:05
Message-ID: CAA4eK1JNK+hX77LTRw-_Z7-kALN1wfe1nzO6PaAOo4JQKdUzYA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Aug 1, 2016 at 8:45 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> writes:
>> On Mon, Aug 1, 2016 at 1:58 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>>> I believe this is wrong and the CHECK_FOR_INTERRUPTS needs to be before
>>> or after the two latch operations. As-is, if the reason somebody set
>>> our latch was to get us to notice that a CHECK_FOR_INTERRUPTS condition
>>> happened, there's a race condition where we'd fail to realize that.
>
>> I could see that in nodeGather.c, it might fail to notice the SetLatch
>> done by worker process or spuriously woken up due to SetLatch for some
>> unrelated reason. However, I don't see what problem it can cause
>> apart from one extra loop cycle where it will try to process the tuple
>> when actually there is no tuple in the queue.
>
> Consider the following sequence of events:
>
> 1. gather_readnext reaches the WaitLatch, and is allowed to pass through
> it for some unrelated reason, perhaps some long-since-handled SIGUSR1
> from a worker process.
>
> 2. gather_readnext does CHECK_FOR_INTERRUPTS(), and sees nothing pending.
>
> 3. A SIGINT is received. StatementCancelHandler sets QueryCancelPending
> and does SetLatch(MyLatch).
>
> 4. gather_readnext does ResetLatch(MyLatch).
>
> 5. gather_readnext runs through its loop again, finds nothing to do, and
> reaches the WaitLatch.
>
> 6. The process is now sleeping on its latch, and might sit there a long
> time before noticing the pending query cancel.
>
> Obviously the window for this race condition is pretty tight --- there's
> not many instructions between steps 2 and 4. But it can happen. If
> memory serves, we've had actual field reports for race condition bugs
> where the window that was being hit wasn't much more than a single
> instruction.
>
> Also, it's entirely possible that the bug could be masked, if there was
> another CHECK_FOR_INTERRUPTS lurking anywhere in the code called within
> the loop. That doesn't excuse this coding practice, though.
>

Right and Thanks for the detailed explanation.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Greg Stark 2016-08-02 11:05:33 Re: New version numbering practices
Previous Message Dilip Kumar 2016-08-02 10:20:07 Re: [sqlsmith] Failed assertion in joinrels.c