Re: Windows buildfarm members vs. new async-notify isolation test

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Magnus Hagander <magnus(at)hagander(dot)net>, Andrew Dunstan <andrew(dot)dunstan(at)2ndquadrant(dot)com>, Mark Dilger <hornschnorter(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Windows buildfarm members vs. new async-notify isolation test
Date: 2019-12-08 05:14:32
Message-ID: CAA4eK1KpRMRJG0krbiL8sUA9wZTVwvoHejEkJK2sVH2idG-rSQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sun, Dec 8, 2019 at 1:26 AM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>
> So, just idly looking at the code in src/backend/port/win32/signal.c
> and src/port/kill.c, I have to wonder why we have this baroque-looking
> design of using *two* signal management threads. And, if I'm
> reading it right, we create an entire new pipe object and an entire
> new instance of the second thread for each incoming signal. Plus, the
> signal senders use CallNamedPipe (hence, underneath, TransactNamedPipe)
> which means they in effect wait for the recipient's signal-handling
> thread to ack receipt of the signal. Maybe there's a good reason for
> all this but it sure seems like a lot of wasted cycles from here.
>
> I have to wonder why we don't have a single named pipe that lasts as
> long as the recipient process does, and a signal sender just writes
> one byte to it, and considers the signal delivered if it is able to
> do that. The "message" semantics seem like overkill for that.
>
> I dug around in the contemporaneous archives and could only find
> https://www.postgresql.org/message-id/303E00EBDD07B943924382E153890E5434AA47%40cuthbert.rcsinc.local
> which describes the existing approach but fails to explain why we
> should do it like that.
>
> This might or might not have much to do with the immediate problem,
> but I can't help wondering if there's some race-condition-ish behavior
> in there that's contributing to what we're seeing.
>

On the receiving side, the work we do after the 'notify' is finished
(or before CallNamedPipe gets control back) is as follows:

pg_signal_dispatch_thread()
{
..
FlushFileBuffers(pipe);
DisconnectNamedPipe(pipe);
CloseHandle(pipe);

pg_queue_signal(sigNum);
}

It seems most of these are the system calls which makes me think that
they might be slow enough on some Windows version that it could lead
to such race condition.

Now, coming back to the other theory the scheduler is not able to
schedule these signal management threads. I think if that would be
the case, then notify could not have finished, because CallNamedPipe
returns only when dispatch thread writes back to the pipe. Now, if
somehow after writing back on the pipe if the scheduler kicks this
thread out, it is possible that we see such behavior, however, I am
not sure if we can do anything about that.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2019-12-08 05:15:36 Re: Windows buildfarm members vs. new async-notify isolation test
Previous Message Karl O. Pinc 2019-12-08 01:23:25 Re: proposal: minscale, rtrim, btrim functions for numeric