Re: Race conditions in 019_replslot_limit.pl

From: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
To: andres(at)anarazel(dot)de
Cc: tgl(at)sss(dot)pgh(dot)pa(dot)us, sawada(dot)mshk(at)gmail(dot)com, hlinnaka(at)iki(dot)fi, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Race conditions in 019_replslot_limit.pl
Date: 2022-05-31 01:31:07
Message-ID: 20220531.103107.1261637053934370702.horikyota.ntt@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

At Mon, 30 May 2022 12:01:55 -0700, Andres Freund <andres(at)anarazel(dot)de> wrote in
> Hi,
>
> On 2022-03-27 22:37:34 -0700, Andres Freund wrote:
> > On 2022-03-27 17:36:14 -0400, Tom Lane wrote:
> > > Andres Freund <andres(at)anarazel(dot)de> writes:
> > > > I still feel like there's something off here. But that's probably not enough
> > > > to keep causing failures. I'm inclined to leave the debugging in for a bit
> > > > longer, but not fail the test anymore?
> > >
> > > WFM.
> >
> > I've done so now.
>
> I did look over the test results a couple times since then and once more
> today. There were a few cases with pretty significant numbers of iterations:
>
> The highest is
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dragonet&dt=2022-04-07%2022%3A14%3A03
> showing:
> # multiple walsenders active in iteration 19
>
> It's somewhat interesting that the worst case was just around the feature
> freeze, where the load on my buildfarm animal boxes was higher than normal.

If disk is too busy, CheckPointReplicationSlots may take very long.

> I comparison to earlier approaches, with the current in-tree approach, we
> don't do anything when hitting the "problem", other than wait. Which does give
> us additional information - afaics there's nothing at all indicating that some
> other backend existed allowing the replication slot drop to finish.

preventing? Only checkpointer and a client backend that ran "SELECT * FROM
pg_stat_activity" are the only processes that are running during the
blocking state.

> It just looks like for reasons I still do not understand, removing a
directory
> and 2 files or so takes multiple seconds (at least ~36 new connections, 18
> pg_usleep(100_100)), while there are no other indications of problems.

That fact suports that CheckPointReplicationSlots took long time.

> I also still don't have a theory why this suddenly started to happen.

Maybe we need to see the load of disks at that time OS-wide. Couldn't
compiler or other non-postgres tools put significant load to disks?

> Unless somebody has another idea, I'm planning to remove all the debugging
> code added, but keep the retry based approach in 019_replslot_limit.pl, so we
> don't again get all the spurious failures.

+1.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2022-05-31 01:34:48 Re: Prevent writes on large objects in read-only transactions
Previous Message Euler Taveira 2022-05-31 00:56:26 Re: Ignore heap rewrites for materialized views in logical replication