From: | Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com> |
---|---|
To: | andres(at)anarazel(dot)de |
Cc: | tgl(at)sss(dot)pgh(dot)pa(dot)us, sawada(dot)mshk(at)gmail(dot)com, hlinnaka(at)iki(dot)fi, pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: Race conditions in 019_replslot_limit.pl |
Date: | 2022-05-31 01:31:07 |
Message-ID: | 20220531.103107.1261637053934370702.horikyota.ntt@gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
At Mon, 30 May 2022 12:01:55 -0700, Andres Freund <andres(at)anarazel(dot)de> wrote in
> Hi,
>
> On 2022-03-27 22:37:34 -0700, Andres Freund wrote:
> > On 2022-03-27 17:36:14 -0400, Tom Lane wrote:
> > > Andres Freund <andres(at)anarazel(dot)de> writes:
> > > > I still feel like there's something off here. But that's probably not enough
> > > > to keep causing failures. I'm inclined to leave the debugging in for a bit
> > > > longer, but not fail the test anymore?
> > >
> > > WFM.
> >
> > I've done so now.
>
> I did look over the test results a couple times since then and once more
> today. There were a few cases with pretty significant numbers of iterations:
>
> The highest is
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dragonet&dt=2022-04-07%2022%3A14%3A03
> showing:
> # multiple walsenders active in iteration 19
>
> It's somewhat interesting that the worst case was just around the feature
> freeze, where the load on my buildfarm animal boxes was higher than normal.
If disk is too busy, CheckPointReplicationSlots may take very long.
> I comparison to earlier approaches, with the current in-tree approach, we
> don't do anything when hitting the "problem", other than wait. Which does give
> us additional information - afaics there's nothing at all indicating that some
> other backend existed allowing the replication slot drop to finish.
preventing? Only checkpointer and a client backend that ran "SELECT * FROM
pg_stat_activity" are the only processes that are running during the
blocking state.
> It just looks like for reasons I still do not understand, removing a
directory
> and 2 files or so takes multiple seconds (at least ~36 new connections, 18
> pg_usleep(100_100)), while there are no other indications of problems.
That fact suports that CheckPointReplicationSlots took long time.
> I also still don't have a theory why this suddenly started to happen.
Maybe we need to see the load of disks at that time OS-wide. Couldn't
compiler or other non-postgres tools put significant load to disks?
> Unless somebody has another idea, I'm planning to remove all the debugging
> code added, but keep the retry based approach in 019_replslot_limit.pl, so we
> don't again get all the spurious failures.
+1.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
From | Date | Subject | |
---|---|---|---|
Next Message | Michael Paquier | 2022-05-31 01:34:48 | Re: Prevent writes on large objects in read-only transactions |
Previous Message | Euler Taveira | 2022-05-31 00:56:26 | Re: Ignore heap rewrites for materialized views in logical replication |