Re: pg_ctl/pg_rewind tests vs. slow AIX buildfarm members

From: Noah Misch <noah(at)leadboat(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: pg_ctl/pg_rewind tests vs. slow AIX buildfarm members
Date: 2015-09-06 00:53:58
Message-ID: 20150906005358.GA3060805@tornado.leadboat.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Sep 04, 2015 at 10:54:51PM -0400, Tom Lane wrote:
> Noah Misch <noah(at)leadboat(dot)com> writes:
> > On Thu, Sep 03, 2015 at 03:31:06PM -0400, Tom Lane wrote:
> >>> This is the first time I've seen an indication that the
> >>> start_postmaster change mentioned in the comment is actually important
> >>> for production use, rather than just being cleanup.
>
> > I scratched my head awhile without thinking of a credible production use case
> > that would notice the difference. Which one did you have in mind?
>
> Well, mainly that it's making our regression tests fail, which suggests
> behavioral instability that could be important for production.

I doubt issuing two "pg_ctl start" within two seconds, with no intervening
stop, has a production use. However, ...

> Aside from the specific case you diagnosed, it's clear from buildfarm
> results that the five-second timeout elsewhere in
> test_postmaster_connection has got funny behavior under load; there are
> way too many cases where pg_ctl gives up after exactly that long, with
> no useful information printed. The draft patch I posted gets rid of that
> behavior ...

... fixing this is a good deal more valuable. Besides coping with stalls in
early postmaster startup, pg_ctl would no longer take 5s to exit when the
postmaster reports a postgresql.conf defect or other early error.

On Thu, Sep 03, 2015 at 03:31:06PM -0400, Tom Lane wrote:
> If this does (or can be made to) work on Windows, I'd propose applying
> it back to 9.2, where the current coding came in.

That seems prudent. We'll keep meeting buildfarm members with performance
variability sufficient to reach that 5s timeout. (Incidentally, the current
coding first appeared in 9.1.)

> --- 650,671 ----
> }
>
> /*
> ! * Reap child process if it died; if it remains in zombie state then
> ! * our kill-based test will think it's still running.
> */
> ! #ifndef WIN32
> {
> ! int exitstatus;
>
> ! (void) waitpid((pid_t) pm_pid, &exitstatus, WNOHANG);

Let's report any exit status. When a signal kills the young postmaster, it's
a waste to omit that detail.

> }
> + #endif
>
> /*
> ! * Check whether the child postmaster process is still alive. This
> ! * lets us exit early if the postmaster fails during startup.
> */
> ! if (!postmaster_is_alive((pid_t) pm_pid))
> return PQPING_NO_RESPONSE;

kill() adds nothing when dealing with a pg_ctl child. waitpid() is enough.

Thanks,
nm

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Greg Stark 2015-09-06 00:58:54 Re: Failing assertions in indxpath.c, placeholder.c and brin_minmax.c
Previous Message Peter Geoghegan 2015-09-06 00:48:10 Re: Using quicksort for every external sort run