Re: "could not reattach to shared memory" on buildfarm member dory

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Noah Misch <noah(at)leadboat(dot)com>
Cc: Heath Lord <heath(dot)lord(at)crunchydata(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Stephen Frost <sfrost(at)snowman(dot)net>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: "could not reattach to shared memory" on buildfarm member dory
Date: 2018-05-01 00:01:40
Message-ID: 29823.1525132900@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I wrote:
> The solution I was thinking about last night was to have
> PGSharedMemoryReAttach call MapViewOfFileEx to map the shared memory
> segment at an unspecified address, then unmap it, then call VirtualFree,
> and finally call MapViewOfFileEx with the real target address. The idea
> here is to get these various DLLs to set up any memory allocation pools
> they're going to set up before we risk doing VirtualFree. I am not,
> at this point, convinced this will fix it :-( ... but I'm not sure what
> else to try.

So the answer is that that doesn't help at all.

It's clear from dory's results that something is causing a 4MB chunk
of memory to get reserved in the process's address space, sometimes.
It might happen during the main MapViewOfFileEx call, or during the
preceding VirtualFree, or with my map/unmap dance in place, it might
happen during that. Frequently it doesn't happen at all, at least not
before the point where we've successfully done MapViewOfFileEx. But
if it does happen, and the chunk happens to get put in a spot that
overlaps where we want to put the shmem block, kaboom.

What seems like a plausible theory at this point is that the apparent
asynchronicity is due to the allocation being triggered by a different
thread, and the fact that our added monitoring code seems to make the
failure more likely can be explained by that code changing the timing.
But what thread could it be? It doesn't really look to me like either
the signal thread or the timer thread could eat 4MB. syslogger.c
also spawns a thread, on Windows, but AFAICS that's not being used in
this test configuration. Maybe the reason dory is showing the problem
is something or other is spawning a thread we don't even know about?

I'm going to go put a 1-sec sleep into the beginning of
PGSharedMemoryReAttach and see if that changes anything. If I'm right
that this is being triggered by another thread, that should allow the
other thread to do its thing (at least most of the time) so that the
failure rate ought to go way down.

Even if that does happen, I'm at a loss for a reasonable way to fix it
for real. Is there a way to seize control of a Windows process so that
there are no other running threads? Any other ideas?

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Thomas Munro 2018-05-01 00:15:21 Re: [HACKERS] Clock with Adaptive Replacement
Previous Message Tom Lane 2018-04-30 23:43:43 Re: EXECUTE does not process parameters