Re: "could not reattach to shared memory" on buildfarm member dory

From: Noah Misch <noah(at)leadboat(dot)com>
To: Heath Lord <heath(dot)lord(at)crunchydata(dot)com>, Stephen Frost <sfrost(at)snowman(dot)net>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, joseph(dot)ayers(at)crunchydata(dot)com
Subject: Re: "could not reattach to shared memory" on buildfarm member dory
Date: 2019-04-02 13:54:42
Message-ID: 20190402135442.GA1173872@rfd.leadboat.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sun, Dec 02, 2018 at 09:35:06PM -0800, Noah Misch wrote:
> On Tue, Sep 25, 2018 at 08:05:12AM -0700, Noah Misch wrote:
> > On Mon, Sep 24, 2018 at 01:53:05PM -0400, Tom Lane wrote:
> > > Overall, I agree that neither of these approaches are exactly attractive.
> > > We're paying a heck of a lot of performance or complexity to solve a
> > > problem that shouldn't even be there, and that we don't understand well.
> > > In particular, the theory that some privileged code is injecting a thread
> > > into every new process doesn't square with my results at
> > > https://www.postgresql.org/message-id/15345.1525145612%40sss.pgh.pa.us
> > >
> > > I think our best course of action at this point is to do nothing until
> > > we have a clearer understanding of what's actually happening on dory.
> > > Perhaps such understanding will yield an idea for a less painful fix.
>
> Could one of you having a dory login use
> https://live.sysinternals.com/Procmon.exe to capture process events during
> backend startup? The ideal would be one capture where startup failed reattach
> and another where it succeeded, but having the successful run alone would be a
> good start.

Joseph Ayers provided, off-list, the capture from a successful startup. It
wasn't materially different from the one my system generates, so I abandoned
that line of inquiry. Having explored other aspects of the problem, I expect
the attached fix will work. I can reproduce the 4 MiB allocations described
in https://postgr.es/m/29823.1525132900@sss.pgh.pa.us; a few times per
"vcregress check", they emerge in the middle of PGSharedMemoryReAttach(). On
my system, there's 5.7 MiB of free address space just before UsedShmemSegAddr,
so the 4 MiB allocation fits in there, and PGSharedMemoryReAttach() does not
fail. Still, it's easy to imagine that boring variations between environments
could surface dory's problem by reducing that free 5.7 MiB to, say, 3.9 MiB.

The 4 MiB allocations are stacks for new threads of the default thread
pool[1]. (I confirmed that by observing their size change when I changed
StackReserveSize in MSBuildProject.pm and by checking all stack pointers with
"thread apply all info frame" in gdb.) The API calls in
PGSharedMemoryReAttach() don't cause the thread creation; it's a timing
coincidence. Commit 2307868 would have worked around the problem, but
pg_usleep() is essentially a no-op on Windows before
pgwin32_signal_initialize() runs. (I'll push Assert(pgwin32_signal_event) to
some functions.) While one fix is to block until all expected threads have
started, that could be notably slow, and I don't know how to implement it
cleanly. I think a better fix is to arrange for the system to prefer a
different address space region for these thread stacks; for details, see the
first comment the patch adds to win32_shmem.c. This works here.

> backend startup sees six thread creations:
>
> 1. main thread
> 2. thread created before postgres.exe has control
> 3. thread created before postgres.exe has control
> 4. thread created before postgres.exe has control
> 5. in pgwin32_signal_initialize()
> 6. in src\backend\port\win32\timer.c:setitimer()
>
> Threads 2-4 exit exactly 30s after creation. If we fail to reattach to shared
> memory, we'll exit before reaching code to start 5 or 6.

Threads 2-4 proved to be worker threads of the default thread pool.

[1] https://docs.microsoft.com/en-us/windows/desktop/ProcThread/thread-pools

Attachment Content-Type Size
shmem-protective-region-v1.patch text/plain 8.2 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Banck 2019-04-02 14:02:59 Re: Progress reporting for pg_verify_checksums
Previous Message Jesper Pedersen 2019-04-02 13:47:01 Re: partitioned tables referenced by FKs