Re: Race conditions in 019_replslot_limit.pl

From: Andres Freund <andres(at)anarazel(dot)de>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, hlinnaka(at)iki(dot)fi, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Race conditions in 019_replslot_limit.pl
Date: 2022-05-30 19:01:55
Message-ID: 20220530190155.47wr3x2prdwyciah@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2022-03-27 22:37:34 -0700, Andres Freund wrote:
> On 2022-03-27 17:36:14 -0400, Tom Lane wrote:
> > Andres Freund <andres(at)anarazel(dot)de> writes:
> > > I still feel like there's something off here. But that's probably not enough
> > > to keep causing failures. I'm inclined to leave the debugging in for a bit
> > > longer, but not fail the test anymore?
> >
> > WFM.
>
> I've done so now.

I did look over the test results a couple times since then and once more
today. There were a few cases with pretty significant numbers of iterations:

The highest is
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dragonet&dt=2022-04-07%2022%3A14%3A03
showing:
# multiple walsenders active in iteration 19

It's somewhat interesting that the worst case was just around the feature
freeze, where the load on my buildfarm animal boxes was higher than normal.

I comparison to earlier approaches, with the current in-tree approach, we
don't do anything when hitting the "problem", other than wait. Which does give
us additional information - afaics there's nothing at all indicating that some
other backend existed allowing the replication slot drop to finish.

It just looks like for reasons I still do not understand, removing a directory
and 2 files or so takes multiple seconds (at least ~36 new connections, 18
pg_usleep(100_100)), while there are no other indications of problems.

I also still don't have a theory why this suddenly started to happen.

Unless somebody has another idea, I'm planning to remove all the debugging
code added, but keep the retry based approach in 019_replslot_limit.pl, so we
don't again get all the spurious failures.

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2022-05-30 19:57:58 Re: Ignoring BRIN for HOT udpates seems broken
Previous Message Tom Lane 2022-05-30 17:11:04 Re: ParseTzFile doesn't FreeFile on error