Re: [HACKERS] Weaker shmem interlock w/o postmaster.pid

From: Noah Misch <noah(at)leadboat(dot)com>
To: Daniel Gustafsson <daniel(at)yesql(dot)se>
Cc: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>, "9erthalion6(at)gmail(dot)com" <9erthalion6(at)gmail(dot)com>, "sfrost(at)snowman(dot)net" <sfrost(at)snowman(dot)net>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] Weaker shmem interlock w/o postmaster.pid
Date: 2019-04-08 06:41:41
Message-ID: 20190408064141.GA2016666@rfd.leadboat.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Apr 04, 2019 at 07:53:19AM -0700, Noah Misch wrote:
> On Wed, Apr 03, 2019 at 07:05:43PM -0700, Noah Misch wrote:
> > Pushed, but that broke two buildfarm members:
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=idiacanthus&dt=2019-04-04%2000%3A33%3A14
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=komodoensis&dt=2019-04-04%2000%3A33%3A13
> >
> > I think the problem arose because these animals run on the same machine, and
> > their test execution was synchronized to the second. Two copies of the new
> > test ran concurrently. It doesn't tolerate that, owing to expectations about
> > which shared memory keys are in use. My initial thought is to fix this by
> > having a third postmaster that runs throughout the test and represents
> > ownership of a given port. If that postmaster gets something other than the
> > first shm key pertaining to its port, switch ports and try again.
> >
> > I'll also include fixes for the warnings Andres reported on the
> > pgsql-committers thread.
>
> This thread's 2019-04-03 patches still break buildfarm members in multiple
> ways. I plan to revert them. I'll wait a day or two before doing that, in
> case more failure types show up.

Notable classes of buildfarm failure:

- AIX animals failed two ways. First, I missed a "use" statement such that
poll_start() would fail if it needed more than one attempt. Second, I
assumed $pid would be gone as soon as kill(9, $pid) returned[1].
- komodoensis and idiacanthus failed due to 16ee6ea not fully resolving the
problems with concurrent execution. I reproduced the various concurrency
bugs by setting up four vpath build trees and looping the one test in each:
for dir in 0 1 2 3; do (until [ -f /tmp/stopprove ]; do make -C $dir/src/test/recovery installcheck PROVE_TESTS=t/017_shm.pl; done) & done; wait; rm /tmp/stopprove
- elver failed due to semaphore exhaustion. I'm reducing max_connections.
- lorikeet's FailedAssertion("!(vmq->mq_sender == ((void *)0))" looked
suspicious, but this happened six other times in the past year[2], always on
v10 lorikeet.
- Commit 0aa0ccf, a wrong back-patch, saw 100% failure of the new test.

While it didn't cause a buildfarm failure, I'm changing the non-test code to
treat shmat() EACCESS as SHMSTATE_FOREIGN, so we ignore that key and move to
another. In the previous version, I treated it as SHMSTATE_ANALYSIS_FAILURE
and blocked startup. In HEAD today, shmat() failure blocks startup if and
only if we got the shmid from postmaster.pid; there's no distinction between
EACCES and other causes.

Attached v4 fixes all the above. I've also attached the incremental diff
versus the code I reverted.

[1] POSIX says "sig or at least one pending unblocked signal shall be
delivered to the sending thread before kill() returns." I doubt the
postmaster had another signal pending often enough to explain the failures, so
AIX probably doesn't follow POSIX in this respect.

[2] All examples in the last year:
sysname │ stage │ branch │ snapshot │ url
──────────┼────────────────┼───────────────┼─────────────────────┼───────────────────────────────────────────────────────────────────────────────────────────
lorikeet │ InstallCheck-C │ REL_10_STABLE │ 2018-05-04 09:49:55 │ https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lorikeet&dt=2018-05-04%2009:49:55
lorikeet │ InstallCheck-C │ REL_10_STABLE │ 2018-05-05 13:15:24 │ https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lorikeet&dt=2018-05-05%2013:15:24
lorikeet │ InstallCheck-C │ REL_10_STABLE │ 2018-05-06 09:33:35 │ https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lorikeet&dt=2018-05-06%2009:33:35
lorikeet │ InstallCheck-C │ REL_10_STABLE │ 2018-05-15 20:52:36 │ https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lorikeet&dt=2018-05-15%2020:52:36
lorikeet │ Check │ REL_10_STABLE │ 2019-02-20 10:40:40 │ https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lorikeet&dt=2019-02-20%2010:40:40
lorikeet │ InstallCheck-C │ REL_10_STABLE │ 2019-03-06 09:31:24 │ https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lorikeet&dt=2019-03-06%2009:31:24
lorikeet │ Check │ REL_10_STABLE │ 2019-04-04 09:47:02 │ https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lorikeet&dt=2019-04-04%2009:47:02

Attachment Content-Type Size
PGSharedMemoryCreate-safety-v4.patch text/plain 38.5 KB
PGSharedMemoryCreate-safety-v4-incr.patch text/plain 6.2 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2019-04-08 06:43:36 Re: Re: A separate table level option to control compression
Previous Message Amit Langote 2019-04-08 06:38:51 Re: ToDo: show size of partitioned table