Re: Unexpected "shared memory block is still in use"

From: Noah Misch <noah(at)leadboat(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Unexpected "shared memory block is still in use"
Date: 2019-05-10 07:22:13
Message-ID: 20190510072213.GA1098093@rfd.leadboat.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, May 09, 2019 at 06:47:58PM -0400, Tom Lane wrote:
> I wrote:
> > However, I have a new theory, after noticing that c09850992 moved the
> > check for shm_nattch == 0. Previously, if a shmem segment had zero attach
> > count, it was unconditionally considered not-a-threat. Now, we'll try
> > shmat() anyway, and if that fails for any reason other than EACCES, we say
> > SHMSTATE_ANALYSIS_FAILURE which leads to the described error report.
> > So I suspect that what we hit was a race condition whereby some other
> > parallel test was using the same shmem ID and we managed to see its
> > segment successfully in shmctl but then it was gone by the time we did
> > shmat. This leads me to think that EINVAL and EIDRM failures from
> > shmat had better be considered SHMSTATE_ENOENT not
> > SHMSTATE_ANALYSIS_FAILURE.
> > In principle this is a longstanding race condition, but I wonder
> > whether we made it more probable by moving the shm_nattch check.
>
> Hah --- this is a real race condition, and I can demonstrate it very
> easily by inserting a sleep right there, as in the attached
> for-testing-only patch.
>
> The particular parallelism level I use is
>
> make -s check-world -j4 PROVE_FLAGS='-j4 --quiet --nocolor --nocount'
>
> on a dual-socket 4-cores-per-socket Xeon machine. With that command and
> this patch, I frequently get multiple failures per run, and they all
> report either EINVAL or EIDRM.
>
> The patch generally reports that nattch had been 1, so my thought that
> that change might've made it worse seems unfounded. But we have
> absolutely got a hittable race condition here. The real fix should
> be on the order of
>
> if (errno == EACCES)
> return SHMSTATE_FOREIGN;
> + else if (errno == EINVAL || errno == EIDRM)
> + return SHMSTATE_ENOENT;
> else
> return SHMSTATE_ANALYSIS_FAILURE;
>
> (plus comments of course).

Looks good. That is basically a defect in commit c09850992; the race passed
from irrelevance to relevance when that commit subjected more segments to the
test. Thanks for diagnosing it.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Noah Misch 2019-05-10 07:47:03 Re: [HACKERS] Weaker shmem interlock w/o postmaster.pid
Previous Message Thomas Munro 2019-05-10 06:18:16 Re: POC: Cleaning up orphaned files using undo logs