Quick Links

Re: Strange failure in LWLock on skink in REL9_5_STABLE

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Strange failure in LWLock on skink in REL9_5_STABLE
Date:	2018-09-21 03:21:38
Message-ID:	20180921032138.c45ysq6kcfwjajhg@alap3.anarazel.de
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On 2018-09-20 22:59:29 -0400, Tom Lane wrote:
> Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com> writes:
> > Andres pinged me off-list to point out this failure after my commit fb389498be:
>
> > ! FATAL: semop(id=332464133) failed: Invalid argument
>
> I was just looking at that, and my guess is that it was caused by
> something doing an ipcrm or equivalent, and is unrelated to your patch.
> Especially since skink has succeeded with that patch in several other
> branches.

I'm (hopefully) the only person with access to that machine, and I
certainly didn't do so. Nor are there script I know of that'd do
so. There's not been a lot of instability on skink, so it's certainly
quite weird.

I'm quite suspicious of the logic around:

/*
* If we received a query cancel or termination signal, we will have
* EINTR set here. If the caller said that errors are OK here, check
* for interrupts immediately.
*/
if (errno == EINTR && elevel >= ERROR)
CHECK_FOR_INTERRUPTS();

because it seems far from guaranteed to do anything meaningful as I
don't see a guarantee that interrupts are active at that point (e.g. it
seems quite reasonable to hold an lwlock while resizing).

Afaict that might cause problems at a later stage, because at that point
we've not adjusted the actual mapping, but *have* ftruncate()ed it. If
there's actual data in the mapping, that certainly could cause trouble.

In fact, while this commit has expanded the size of the problem, I fail
to see how the error handling for resizing is correct. It's fine to fail
in the ftruncate() itself - at that point no changes have been made -,
but I don't think it's currently ok for posix_fallocate() to ever error
out.

It's not clear to me how that'd be problematic in 9.5 of all releases
however.

> If it's repeatable, then it would be time to get excited.

Yea, I guess we'll have to wait :/.

Greetings,

Andres Freund

In response to

Re: Strange failure in LWLock on skink in REL9_5_STABLE at 2018-09-21 02:59:29 from Tom Lane

Responses

Re: Strange failure in LWLock on skink in REL9_5_STABLE at 2018-09-21 04:03:18 from Thomas Munro

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Andres Freund	2018-09-21 03:36:12	Re: Strange failure in LWLock on skink in REL9_5_STABLE
Previous Message	Tom Lane	2018-09-21 03:15:45	Re: Strange failure in LWLock on skink in REL9_5_STABLE