Re: Race conditions in 019_replslot_limit.pl

From: Andres Freund <andres(at)anarazel(dot)de>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, hlinnaka(at)iki(dot)fi, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Race conditions in 019_replslot_limit.pl
Date: 2022-03-25 00:02:34
Message-ID: 20220325000234.uzsbjzk4evshdizl@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2022-03-24 15:05:40 -0400, Tom Lane wrote:
> I just noticed something very interesting: in a couple of recent
> buildfarm runs with this failure, the pg_stat_activity printout
> no longer shows the extra walsender:

Oh. That is interesting. Thanks for catching that.

> What this suggests to me is that maybe the extra walsender is
> indeed not blocked on anything, but is just taking its time
> about exiting. In these two runs, as well as in all the
> non-failing runs, it had enough time to do so.

Still odd that it started with the bugfix I committed. And the locking pattern
in [1] still seems suspicious. But the obove does seem to suggest those might
just have been red herrings.

I had previously tried to make process exit slow by inserting delays in
various points and only succeeded in tests failing due too many
connections. For some reason I didn't increase max_connections...

I now increased the test's max_connections and managed to get a kind of
similar failure by sticking
elog(LOG, "received disconnect");
pg_usleep(220000);
into the 'X' case in PostgresMain().

However, I always see three pids or the expected one pid, not two like the BF
cases. But it's not too surprising that such an artificial, uniform, slowdown
would have different symptoms than reality.

> I suggest that we add a couple-of-seconds sleep in front of
> the query that collects walsender PIDs, and maybe a couple more
> seconds before the pg_stat_activity probe in the failure path,
> and see if the behavior changes at all.

Makes sense. I was previously thinking that it'd make sense to sleep for a bit
before the ->stop('fast'); calls, so that we can see whether it's just the
shutdown unblocking things (which [1] seems to suggest).

How about the attached variation, which retries (for 15s, with 100ms sleeps)
if there are multiple walsenders, printing the debugging info each time? It'll
still fail the test if later iterations find just one walsender, which seems
the right behaviour for now.

> That should be enough to confirm or disprove this idea pretty quickly. If
> it is right, a permanent fix could be to wait for the basebackup's walsender
> to disappear from node_primary3's pg_stat_activity before we start the one
> for node_standby3.

For some tests a "confimed disconnect" mode would be useful. Basically the
client waiting until it receives EOF after sending 'X'.

Greetings,

Andres Freund

[1] https://postgr.es/m/20220226072325.wtvkwvvga2wc3nkn%40alap3.anarazel.de

Attachment Content-Type Size
019_replslot_limit_debug.diff text/x-diff 2.2 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2022-03-25 00:05:10 Re: [HACKERS] WIP aPatch: Pgbench Serialization and deadlock errors
Previous Message Tatsuo Ishii 2022-03-24 23:50:44 Re: [HACKERS] WIP aPatch: Pgbench Serialization and deadlock errors