Re: Backends stunk in wait event IPC/MessageQueueInternal

From: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Japin Li <japinli(at)hotmail(dot)com>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Backends stunk in wait event IPC/MessageQueueInternal
Date: 2022-05-13 21:25:08
Message-ID: CA+hUKGKuV-TSSRVMjRhV4GuSktxj3-HuA6S+H1JQku7anFY5gw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, May 14, 2022 at 2:09 AM Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Fri, May 13, 2022 at 6:16 AM Japin Li <japinli(at)hotmail(dot)com> wrote:
> > The process cannot be terminated by pg_terminate_backend(), although
> > it returns true.

> One thing I find a bit curious is that the top of the stack in your
> case is ioctl(). And there are no calls to ioctl() anywhere in
> latch.c, nor have there ever been. What operating system is this? We
> have 4 different versions of WaitEventSetWaitBlock() that call
> epoll_wait(), kevent(), poll(), and WaitForMultipleObjects()
> respectively. I wonder which of those we're using, and whether one of
> those calls is showing up as ioctl() in the stacktrace, or whether
> there's some other function being called in here that is somehow
> resulting in ioctl() getting called.

I guess this is really illumos (née OpenSolaris), not Solaris, using
our epoll build mode, with illumos's emulation of epoll, which maps
epoll onto Sun's /dev/poll driver:

https://github.com/illumos/illumos-gate/blob/master/usr/src/lib/libc/port/sys/epoll.c#L230

That'd explain:

fffffb7fef216f4a ioctl (d, d001, fffffb7fffdfa0e0)

That matches the value DP_POLL from:

https://github.com/illumos/illumos-gate/blob/master/usr/src/uts/common/sys/devpoll.h#L44

Or if it's really Solaris, huh, are people moving illumos code back
into closed Solaris these days?

As for why it's hanging, I don't know, but one thing that we changed
in 14 was that we started using signalfd() to receive latch signals on
systems that have it, and illumos also has an emulation of signalfd()
that our configure script finds:

https://github.com/illumos/illumos-gate/blob/master/usr/src/uts/common/io/signalfd.c

There were in fact a couple of unexplained hangs on the illumos build
farm animals, and then they were changed to use -DWAIT_USE_POLL so
that they wouldn't automatically choose epoll()/signalfd(). That is
not very satisfactory, but as far as I know there is a bug in either
epoll() or signalfd(), or at least some difference compared to the
Linux implementation they are emulating. spent quite a bit of time
ping ponging emails back and forth with the owner of a hanging BF
animal trying to get a minimal repro for a bug report, without
success. I mean, it's possible that the bug is in PostgreSQL (though
no complaint has ever reached me about this stuff on Linux), but while
trying to investigate it a kernel panic happened[1], which I think
counts as a point against that theory...

(For what it's worth, WSL1 also emulates these two Linux interfaces
and also apparently doesn't do so well enough for our purposes, also
for reasons not understood by us.)

In short, I'd recommend -DWAIT_USE_POLL for now. It's possible that
we could do something to prevent the selection of WAIT_USE_EPOLL on
that platform, or that we should have a halfway option epoll() but not
signalfd() (= go back to using the self-pipe trick), patches welcome,
but that feels kinda strange and would be very niche combination that
isn't fun to maintain... the real solution is to fix the bug.

[1] https://www.illumos.org/issues/13700

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Thomas Munro 2022-05-13 22:25:20 Re: Backends stunk in wait event IPC/MessageQueueInternal
Previous Message Zheng Li 2022-05-13 21:01:33 Re: Support logical replication of DDLs