Re: Hang in pldebugger after git commit : 98a64d0

From: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To: Ashutosh Sharma <ashu(dot)coek88(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>
Subject: Re: Hang in pldebugger after git commit : 98a64d0
Date: 2016-12-09 06:23:49
Message-ID: CAB7nPqSs-1FVUG7he-Lp8M0j8HdkxD-W4CzxzuPemprKZZRh8g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Dec 9, 2016 at 2:38 AM, Ashutosh Sharma <ashu(dot)coek88(at)gmail(dot)com> wrote:
> Well, we are currently passing INFINITE timeout value to
> WaitForMultipleObjects() API which is hanging. I thought of changing this
> INIFINTE timeout interval to some finite value like say 1 sec and could not
> reproduce the issue. But, having said that i checked the older code where
> this issue does not exists and found here as well we are passing INFINTE
> timeout value to WaitForMultipleObjects(). So not sure if this passing
> INFINITE timeout is really an issue. Attached is a small patch that has my
> changes.

That's obviously incorrect to me. This should not wait for a timeout
if the caller does not want to. So you are likely breaking a bunch of
code by doing so, including many extensions on Windows. The
pre-WaitEventSet code uses that:
- if (wakeEvents & WL_TIMEOUT)
- {
- INSTR_TIME_SET_CURRENT(start_time);
- Assert(timeout >= 0 && timeout <= INT_MAX);
- cur_timeout = timeout;
- }
- else
- cur_timeout = INFINITE;
INFINITE maps to -1 by looking at the MS docs, and that's as well what
the new code does so the inconsistency is not there. And the new code
does not bother about setting INFINITE and just uses -1.

I have tried to compile pldebugger with MSVC but gave up at the end,
so I am falling back to some code review for the time being. Andres
mentioned me that this Windows code was in need of an extra lookup.
And from what I can see, the logic that we have before 98a64d0 is a
set of handles using WaitForMultipleObjects with a pre-allocated
position in the handle array. The new code made things more generic by
allocating the events depending on what the user has set up.
pgwin32_signal_event is correctly using the first handle, and other
events registered correctly adapt to this position.

I have spent some time reviewing the code, and I think that I have
spotted one problem. The event set that the code you are triggering is
waiting for is FeBeWaitSet, which gets initialized in
pq_init()@pqcomm.c. 3 events are being set there. This goes through
CreateWaitEventSet, that sets up pgwin32_signal_event as first handle.
So far so good. Then a bunch of events are added with
AddWaitEventToSet. Which is fine as well as the handling of handles
with WSA_INVALID_EVENT looks correct, because its value is NULL and
there is a static assertion to check things.

One place in the code has spotted my attention:
+#elif defined(WAIT_USE_WIN32)
+ set->handles = (HANDLE) data;
+ data += sizeof(HANDLE) * nevents;
+#endif
This should be actually (nevents + 1) to take into account
pgwin32_signal_event. That's harmless now but it would if new fields
are added to WaitEventSet in the future.

A second thing that I am noticing is that in the new code:
+ * Note: we assume that the kernel calls involved in latch management
+ * will provide adequate synchronization on machines with weak memory
+ * ordering, so that we cannot miss seeing is_set if a notification
+ * has already been queued.
+ */
+ if (set->latch && set->latch->is_set)
+ {
+ occurred_events->fd = PGINVALID_SOCKET;
+ occurred_events->pos = set->latch_pos;
+ occurred_events->user_data =
+ set->events[set->latch_pos].user_data;
+ occurred_events->events = WL_LATCH_SET;
+ occurred_events++;
+ returned_events++;
+
+ break;
+ }
This basically means that if the latch is set, we don't wait at all
and drop the ball. I am wondering: isn't that a problem even if
WL_LATCH_SET is *not* set? If I read this code correctly, even if
caller has not set WL_LATCH_SET and the latch is set, then the wait
will stop.

Still reviewing the code...
--
Michael

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Masahiko Sawada 2016-12-09 07:02:23 Re: Transactions involving multiple postgres foreign servers
Previous Message Venkata B Nagothi 2016-12-09 06:16:12 Re: Declarative partitioning - another take