Re: stress test for parallel workers

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc: Justin Pryzby <pryzby(at)telsasoft(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: stress test for parallel workers
Date: 2019-10-07 04:07:48
Message-ID: 18537.1570421268@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Thomas Munro <thomas(dot)munro(at)gmail(dot)com> writes:
> On Wed, Aug 7, 2019 at 4:29 PM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> Yeah, I've been wondering whether pg_ctl could fork off a subprocess
>> that would fork the postmaster, wait for the postmaster to exit, and then
>> report the exit status. Where to report it *to* seems like the hard part,
>> but maybe an answer that worked for the buildfarm would be enough for now.

> Oh, right, you don't even need subreaper tricks (I was imagining we
> had a double fork somewhere we don't).

I got around to looking at how to do this. Seeing that chipmunk hasn't
failed again, I'm inclined to write that off as perhaps unrelated.
That leaves us to diagnose the pg_upgrade failures on wobbegong and
vulpes. The pg_upgrade test uses pg_ctl to start the postmaster,
and the only simple way to wedge this requirement into pg_ctl is as
attached. Now, the attached is completely *not* suitable as a permanent
patch, because it degrades or breaks a number of pg_ctl behaviors that
rely on knowing the postmaster's actual PID rather than that of the
parent shell. But it gets through check-world, so I think we can stick it
in transiently to see what it can teach us about the buildfarm failures.
Given wobbegong's recent failure rate, I don't think we'll have to wait
long.

Some notes about the patch:

* The core idea is to change start_postmaster's shell invocation
so that the shell doesn't just exec the postmaster, but runs a
mini shell script that runs the postmaster and then reports its
exit status. I found that this still needed a dummy exec to force
the shell to perform the I/O redirections on itself, else pg_ctl's
TAP tests fail. (I think what was happening was that if the shell
continued to hold open its original stdin, IPC::Run didn't believe
the command was done.)

* This means that what start_postmaster returns is not the postmaster's
own PID, but that of the parent shell. So we have to lobotomize
wait_for_postmaster to handle the PID the same way as on Windows
(where that was already true); it can't test for exact equality
between the child process PID and what's in postmaster.pid.
(trap_sigint_during_startup is also broken, but we don't need that
to work to get through the regression tests.)

* That makes recovery/t/017_shm.pl fail, because there's a race
condition: after killing the old postmaster, the existing
postmaster.pid is enough to fool "pg_ctl start" into thinking the new
postmaster is already running. I fixed that by making pg_ctl reject
any PID seen in a pre-existing postmaster.pid file. That has a
nonzero probability of false match, so I would not want to stick it
in as a permanent thing on Unix ... but I wonder if it wouldn't be
an improvement over the current situation on Windows.

regards, tom lane

Attachment Content-Type Size
hack-report-postmaster-exit-status-from-pg_ctl.patch text/x-diff 2.8 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message vignesh C 2019-10-07 04:08:41 Re: Updated some links which are not working with new links
Previous Message Masahiko Sawada 2019-10-07 04:04:56 Re: [HACKERS] Block level parallel vacuum