Re: Race condition in SyncRepGetSyncStandbysPriority

From: Masahiko Sawada <masahiko(dot)sawada(at)2ndquadrant(dot)com>
To: Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Race condition in SyncRepGetSyncStandbysPriority
Date: 2020-04-01 02:01:22
Message-ID: CA+fd4k7rFQ+3fSA-gYho9FbOA3v+E9865jKSg2RDiBOTYMQPYQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, 31 Mar 2020 at 23:16, Masahiko Sawada
<masahiko(dot)sawada(at)2ndquadrant(dot)com> wrote:
>
> On Fri, 27 Mar 2020 at 13:54, Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com> wrote:
> >
> >
> >
> > On 2020/03/27 10:26, Tom Lane wrote:
> > > Twice in the past month [1][2], buildfarm member hoverfly has managed
> > > to reach the "unreachable" Assert(false) at the end of
> > > SyncRepGetSyncStandbysPriority.
> >
> > When I search the past discussions, I found that Noah Misch reported
> > the same issue.
> > https://www.postgresql.org/message-id/20200206074552.GB3326097@rfd.leadboat.com
> >
> > > What seems likely to me, after quickly eyeballing the code, is that
> > > hoverfly is hitting the blatantly-obvious race condition in that function.
> > > Namely, that the second loop supposes that the state of the walsender
> > > array hasn't changed since the first loop.
> > >
> > > The minimum fix for this, I suppose, would have the first loop capture
> > > the sync_standby_priority value for each walsender along with what it's
> > > already capturing. But I wonder if the whole function shouldn't be
> > > rewritten from scratch, because it seems like the algorithm is both
> > > expensively brute-force and unintelligible, which is a sad combination.
> > > It's likely that the number of walsenders would never be high enough
> > > that efficiency could matter, but then couldn't we use an algorithm
> > > that is less complicated and more obviously correct?
> >
> > +1 to rewrite the function with better algorithm.
> >
> > > (Because the
> > > alternative conclusion, if you reject the theory that a race is happening,
> > > is that the algorithm is just flat out buggy; something that's not too
> > > easy to disprove either.)
> > >
> > > Another fairly dubious thing here is that whether or not *am_sync
> > > gets set depends not only on whether MyWalSnd is claiming to be
> > > synchronous but on how many lower-numbered walsenders are too.
> > > Is that really the right thing?
> > >
> > > But worse than any of that is that the return value seems to be
> > > a list of walsender array indexes, meaning that the callers cannot
> > > use it without making even stronger assumptions about the array
> > > contents not having changed since the start of this function.
> > >
> > > It sort of looks like the design is based on the assumption that
> > > the array contents can't change while SyncRepLock is held ... but
> > > if that's the plan then why bother with the per-walsender spinlocks?
> > > In any case this assumption seems to be failing, suggesting either
> > > that there's a caller that's not holding SyncRepLock when it calls
> > > this function, or that somebody is failing to take that lock while
> > > modifying the array.
> >
> > As far as I read the code, that assumption seems still valid. But the problem
> > is that each walsender updates MyWalSnd->sync_standby_priority at each
> > convenient timing, when SIGHUP is signaled. That is, at a certain moment,
> > some walsenders (also their WalSnd entries in shmem) work based on
> > the latest configuration but the others (also their WalSnd entries) work based
> > on the old one.
> >
> > lowest_priority = SyncRepConfig->nmembers;
> > next_highest_priority = lowest_priority + 1;
> >
> > SyncRepGetSyncStandbysPriority() calculates the lowest priority among
> > all running walsenders as the above, by using the configuration info that
> > this walsender is based on. But this calculated lowest priority would be
> > invalid if other walsender is based on different (e.g., old) configuraiton.
> > This can cause the (other) walsender to have lower priority than
> > the calculated lowest priority and the second loop in
> > SyncRepGetSyncStandbysPriority() to unexpectedly end.
>
> I have the same understanding. Since sync_standby_priroity is
> protected by SyncRepLock these values of each walsender are not
> changed through two loops in SyncRepGetSyncStandbysPriority().
> However, as Fujii-san already mentioned the true lowest priority can
> be lower than lowest_priority, nmembers, when only part of walsenders
> reloaded the configuration, which in turn could be the cause of
> leaving entries in the pending list at the end of the function.
>
> > Therefore, the band-aid fix seems to be to set the lowest priority to
> > very large number at the beginning of SyncRepGetSyncStandbysPriority().
>
> I think we can use max_wal_senders.

Sorry, that's not true. We need another number large enough.

Regards,

--
Masahiko Sawada http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message James Coleman 2020-04-01 02:12:29 Re: [PATCH] Incremental sort (was: PoC: Partial sort)
Previous Message Tomas Vondra 2020-04-01 01:59:25 Re: [PATCH] Incremental sort (was: PoC: Partial sort)