Re: Race condition in SyncRepGetSyncStandbysPriority

From: Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Race condition in SyncRepGetSyncStandbysPriority
Date: 2020-03-27 04:54:25
Message-ID: 230056dc-01fd-039c-dc48-b2e086d84da5@oss.nttdata.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2020/03/27 10:26, Tom Lane wrote:
> Twice in the past month [1][2], buildfarm member hoverfly has managed
> to reach the "unreachable" Assert(false) at the end of
> SyncRepGetSyncStandbysPriority.

When I search the past discussions, I found that Noah Misch reported
the same issue.
https://www.postgresql.org/message-id/20200206074552.GB3326097@rfd.leadboat.com

> What seems likely to me, after quickly eyeballing the code, is that
> hoverfly is hitting the blatantly-obvious race condition in that function.
> Namely, that the second loop supposes that the state of the walsender
> array hasn't changed since the first loop.
>
> The minimum fix for this, I suppose, would have the first loop capture
> the sync_standby_priority value for each walsender along with what it's
> already capturing. But I wonder if the whole function shouldn't be
> rewritten from scratch, because it seems like the algorithm is both
> expensively brute-force and unintelligible, which is a sad combination.
> It's likely that the number of walsenders would never be high enough
> that efficiency could matter, but then couldn't we use an algorithm
> that is less complicated and more obviously correct?

+1 to rewrite the function with better algorithm.

> (Because the
> alternative conclusion, if you reject the theory that a race is happening,
> is that the algorithm is just flat out buggy; something that's not too
> easy to disprove either.)
>
> Another fairly dubious thing here is that whether or not *am_sync
> gets set depends not only on whether MyWalSnd is claiming to be
> synchronous but on how many lower-numbered walsenders are too.
> Is that really the right thing?
>
> But worse than any of that is that the return value seems to be
> a list of walsender array indexes, meaning that the callers cannot
> use it without making even stronger assumptions about the array
> contents not having changed since the start of this function.
>
> It sort of looks like the design is based on the assumption that
> the array contents can't change while SyncRepLock is held ... but
> if that's the plan then why bother with the per-walsender spinlocks?
> In any case this assumption seems to be failing, suggesting either
> that there's a caller that's not holding SyncRepLock when it calls
> this function, or that somebody is failing to take that lock while
> modifying the array.

As far as I read the code, that assumption seems still valid. But the problem
is that each walsender updates MyWalSnd->sync_standby_priority at each
convenient timing, when SIGHUP is signaled. That is, at a certain moment,
some walsenders (also their WalSnd entries in shmem) work based on
the latest configuration but the others (also their WalSnd entries) work based
on the old one.

lowest_priority = SyncRepConfig->nmembers;
next_highest_priority = lowest_priority + 1;

SyncRepGetSyncStandbysPriority() calculates the lowest priority among
all running walsenders as the above, by using the configuration info that
this walsender is based on. But this calculated lowest priority would be
invalid if other walsender is based on different (e.g., old) configuraiton.
This can cause the (other) walsender to have lower priority than
the calculated lowest priority and the second loop in
SyncRepGetSyncStandbysPriority() to unexpectedly end.

Therefore, the band-aid fix seems to be to set the lowest priority to
very large number at the beginning of SyncRepGetSyncStandbysPriority().

Regards,

--
Fujii Masao
NTT DATA CORPORATION
Advanced Platform Technology Group
Research and Development Headquarters

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2020-03-27 05:06:44 Re: backup manifests
Previous Message Justin Pryzby 2020-03-27 04:44:24 Re: error context for vacuum to include block number