Quick Links

Re: Broken order-of-operations in parallel query latch manipulation

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Broken order-of-operations in parallel query latch manipulation
Date:	2016-08-01 15:15:23
Message-ID:	6446.1470064523@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> writes:
> On Mon, Aug 1, 2016 at 1:58 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> I believe this is wrong and the CHECK_FOR_INTERRUPTS needs to be before
>> or after the two latch operations. As-is, if the reason somebody set
>> our latch was to get us to notice that a CHECK_FOR_INTERRUPTS condition
>> happened, there's a race condition where we'd fail to realize that.

> I could see that in nodeGather.c, it might fail to notice the SetLatch
> done by worker process or spuriously woken up due to SetLatch for some
> unrelated reason. However, I don't see what problem it can cause
> apart from one extra loop cycle where it will try to process the tuple
> when actually there is no tuple in the queue.

Consider the following sequence of events:

1. gather_readnext reaches the WaitLatch, and is allowed to pass through
it for some unrelated reason, perhaps some long-since-handled SIGUSR1
from a worker process.

2. gather_readnext does CHECK_FOR_INTERRUPTS(), and sees nothing pending.

3. A SIGINT is received. StatementCancelHandler sets QueryCancelPending
and does SetLatch(MyLatch).

4. gather_readnext does ResetLatch(MyLatch).

5. gather_readnext runs through its loop again, finds nothing to do, and
reaches the WaitLatch.

6. The process is now sleeping on its latch, and might sit there a long
time before noticing the pending query cancel.

Obviously the window for this race condition is pretty tight --- there's
not many instructions between steps 2 and 4. But it can happen. If
memory serves, we've had actual field reports for race condition bugs
where the window that was being hit wasn't much more than a single
instruction.

Also, it's entirely possible that the bug could be masked, if there was
another CHECK_FOR_INTERRUPTS lurking anywhere in the code called within
the loop. That doesn't excuse this coding practice, though.

BTW, now that I look at it, CHECK_FOR_INTERRUPTS subsumes
HandleParallelMessages(), which means the direct call to the latter
at the top of gather_readnext's loop is pretty bogus. I now think
the right fix in gather_readnext is to move the CHECK_FOR_INTERRUPTS
macro to the top of the loop, replacing that call. The places in
shm_mq.c that have this issue should probably look like
ProcWaitForSignal, though.

regards, tom lane

In response to

Re: Broken order-of-operations in parallel query latch manipulation at 2016-08-01 04:07:25 from Amit Kapila

Responses

Re: Broken order-of-operations in parallel query latch manipulation at 2016-08-02 10:21:05 from Amit Kapila

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Tom Lane	2016-08-01 15:22:04	Re: Combining hash values
Previous Message	Stephen Frost	2016-08-01 15:03:07	Re: [PATCH v12] GSSAPI encryption support