Re: BF animal malleefowl reported an failure in 001_password.pl

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "houzj(dot)fnst(at)fujitsu(dot)com" <houzj(dot)fnst(at)fujitsu(dot)com>
Cc: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: BF animal malleefowl reported an failure in 001_password.pl
Date: 2023-01-14 07:55:37
Message-ID: 934208.1673682937@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

"houzj(dot)fnst(at)fujitsu(dot)com" <houzj(dot)fnst(at)fujitsu(dot)com> writes:
> I noticed one BF failure[1] when monitoring the BF for some other commit.
> [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=malleefowl&dt=2023-01-13%2009%3A54%3A51
> ...
> So it seems the connection happens before pg_ident.conf is actually reloaded ?
> Not sure if we need to do something make sure the reload happen, because it's
> looks like very rare failure which hasn't happen in last 90 days.

That does look like a race condition between config reloading and
new-backend launching. However, I can't help being suspicious about
the fact that we haven't seen this symptom before and now here it is
barely a day after 7389aad63 (Use WaitEventSet API for postmaster's
event loop). It seems fairly plausible that that did something that
causes the postmaster to preferentially process connection-accept ahead
of SIGHUP. I took a quick look through the code and did not see a
smoking gun, but I'm way too tired to be sure I didn't miss something.

In general, use of WaitEventSet instead of signals will tend to slot
the postmaster into non-temporally-ordered event responses in two
ways: (1) the latch.c code will report events happening at more-or-less
the same time in a specific order, and (2) the postmaster.c code will
react to signal-handler-set flags in a specific order. AFAICS, both
of those code layers will prioritize latch events ahead of
connection-accept events, but did I misread it?

Also it seems like the various platform-specific code paths in latch.c
could diverge as to the priority order of events, which could cause
annoying platform-specific behavior. Not sure there's much to be
done there other than to be sensitive to not letting such divergence
happen.

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jeff Davis 2023-01-14 08:48:52 Re: Improve WALRead() to suck data directly from WAL buffers when possible
Previous Message vignesh C 2023-01-14 07:26:19 Re: fixing CREATEROLE