Re: Windows buildfarm members vs. new async-notify isolation test

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Magnus Hagander <magnus(at)hagander(dot)net>, Andrew Dunstan <andrew(dot)dunstan(at)2ndquadrant(dot)com>, Mark Dilger <hornschnorter(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Windows buildfarm members vs. new async-notify isolation test
Date: 2019-12-07 19:56:26
Message-ID: 4412.1575748586@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I wrote:
> Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> writes:
>> On Sat, Dec 7, 2019 at 5:01 AM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>>> A possible theory as to what's happening is that the kernel scheduler
>>> is discriminating against listener2's signal management thread(s)
>>> and not running them until everything else goes idle for a moment.

>> If we have to believe that theory then why the other similar test is
>> not showing the problem.

> There are fewer processes involved in that case, so I don't think
> it disproves the theory that this is a scheduler glitch.

So, just idly looking at the code in src/backend/port/win32/signal.c
and src/port/kill.c, I have to wonder why we have this baroque-looking
design of using *two* signal management threads. And, if I'm
reading it right, we create an entire new pipe object and an entire
new instance of the second thread for each incoming signal. Plus, the
signal senders use CallNamedPipe (hence, underneath, TransactNamedPipe)
which means they in effect wait for the recipient's signal-handling
thread to ack receipt of the signal. Maybe there's a good reason for
all this but it sure seems like a lot of wasted cycles from here.

I have to wonder why we don't have a single named pipe that lasts as
long as the recipient process does, and a signal sender just writes
one byte to it, and considers the signal delivered if it is able to
do that. The "message" semantics seem like overkill for that.

I dug around in the contemporaneous archives and could only find
https://www.postgresql.org/message-id/303E00EBDD07B943924382E153890E5434AA47%40cuthbert.rcsinc.local
which describes the existing approach but fails to explain why we
should do it like that.

This might or might not have much to do with the immediate problem,
but I can't help wondering if there's some race-condition-ish behavior
in there that's contributing to what we're seeing. We already had to
fix a couple of race conditions from doing it like this, cf commits
2e371183e, 04a4413c2, f27a4696f. Perhaps 0ea1f2a3a is relevant
as well.

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrew Dunstan 2019-12-07 21:03:27 Re: ssl passphrase callback
Previous Message Tom Lane 2019-12-07 17:58:12 Re: psql small improvement patch