Re: Race condition in SyncRepGetSyncStandbysPriority

From: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
To: tgl(at)sss(dot)pgh(dot)pa(dot)us
Cc: masahiko(dot)sawada(at)2ndquadrant(dot)com, masao(dot)fujii(at)oss(dot)nttdata(dot)com, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Race condition in SyncRepGetSyncStandbysPriority
Date: 2020-04-13 06:31:01
Message-ID: 20200413.153101.688623432137444694.horikyota.ntt@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

At Sat, 11 Apr 2020 18:30:30 -0400, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote in
> Masahiko Sawada <masahiko(dot)sawada(at)2ndquadrant(dot)com> writes:
> > On Tue, 31 Mar 2020 at 23:16, Masahiko Sawada
> > <masahiko(dot)sawada(at)2ndquadrant(dot)com> wrote:
> >>> Therefore, the band-aid fix seems to be to set the lowest priority to
> >>> very large number at the beginning of SyncRepGetSyncStandbysPriority().
>
> >> I think we can use max_wal_senders.
>
> > Sorry, that's not true. We need another number large enough.
>
> The buildfarm had another three failures of this type today, so that
> motivated me to look at it some more. I don't think this code needs
> a band-aid fix; I think "nuke from orbit" is more nearly the right
> level of response.
>
> The point that I was trying to make originally is that it seems quite
> insane to imagine that a walsender's sync_standby_priority value is
> somehow more stable than the very existence of the process. Yet we
> only require a walsender to lock its own mutex while claiming or
> disowning its WalSnd entry (by setting or clearing the pid field).
> So I think it's nuts to define those fields as being protected by
> the global SyncRepLock.

Right. FWIW, furthermore, even SyncRepConfig->syncrep_method can be
inconsistent among walsenders. I haven't thought that it can be
relied on as always consistent and it is enough that it makes a
consistent result only while the setting and the set of walsenders is
stable.

> Even without considering the possibility that a walsender has just
> started or stopped, we have the problem Fujii-san described that after
> a change in the synchronous_standby_names GUC setting, different
> walsenders will update their values of sync_standby_priority at
> different instants. (BTW, I now notice that Noah had previously
> identified this problem at [1].)
>
> Thus, even while holding SyncRepLock, we do not have a guarantee that
> we'll see a consistent set of sync_standby_priority values. In fact
> we don't even know that the walsnd array entries still belong to the
> processes that last set those values. This is what is breaking
> SyncRepGetSyncStandbysPriority, and what it means is that there's
> really fundamentally no chance of that function producing trustworthy
> results. The "band aid" fixes discussed here might avoid crashing on
> the Assert, but they won't fix the problems that (a) the result is
> possibly wrong and (b) it can become stale immediately even if it's
> right when returned.

Agreed. And I thought that it's not a problem if we had wrong result
temporarily. And the unstability persists for the standby-reply
interval at most (unless the next cause of instability comes).

> Now, there are only two callers of SyncRepGetSyncStandbys:
> SyncRepGetSyncRecPtr and pg_stat_get_wal_senders. The latter is
> mostly cosmetic (which is a good thing, because to add insult to
> injury, it continues to use the list after releasing SyncRepLock;
> not that continuing to hold that lock would make things much safer).
> If I'm reading the code correctly, the former doesn't really care
> exactly which walsenders are sync standbys: all it cares about is
> to collect their WAL position pointers.

Agreed. To find the sync standby with the largest delay.

> What I think we should do about this is, essentially, to get rid of
> SyncRepGetSyncStandbys. Instead, let's have each walsender advertise
> whether *it* believes that it is a sync standby, based on its last
> evaluation of the relevant GUCs. This would be a bool that it'd
> compute and set alongside sync_standby_priority. (Hm, maybe we'd not

Mmm.. SyncRepGetStandbyPriority returns the "priority" that a
walsender thinks it is at, among synchronous_standby_names. Then to
decide "I am a sync standby" we need to know how many walsenders with
higher priority are alive now. SyncRepGetSyncStandbyPriority does the
judgment now and suffers from the inconsistency of priority values.

In short, it seems to me like moving the problem into another
place. But I think that there might be a smarter way to find "I am
sync".

> even need to have that field anymore? Not sure.) We should also
> redefine that flag, and sync_standby_priority if it survives, as being
> protected by the per-walsender mutex not SyncRepLock. Then, what
> SyncRepGetSyncRecPtr would do is just sweep through the walsender
> array and collect WAL position pointers from the walsenders that
> claim to be sync standbys at the instant that it's inspecting them.
> pg_stat_get_wal_senders could also use those flags instead of the
> list from SyncRepGetSyncStandbys.
>
> It's likely that this definition would have slightly different
> behavior from the current implementation during the period where
> the system is absorbing a change in the set of synchronous
> walsenders. However, since the current implementation is visibly
> wrong during that period anyway, I'm not sure how this could be
> worse. And at least we can be certain that SyncRepGetSyncRecPtr
> will not include WAL positions from already-dead walsenders in
> its calculations, which *is* a hazard in the code as it stands.
>
> I also estimate that this would be noticeably more efficient than
> the current code, since the logic to decide who's a sync standby
> would only run when we're dealing with walsender start/stop or
> SIGHUP, rather than every time SyncRepGetSyncRecPtr runs.
>
> Don't especially want to code this myself, though. Anyone?
>
> regards, tom lane
>
> [1] https://www.postgresql.org/message-id/flat/20200206074552.GB3326097%40rfd.leadboat.com

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Kyotaro Horiguchi 2020-04-13 06:34:24 Re: Race condition in SyncRepGetSyncStandbysPriority
Previous Message Masahiko Sawada 2020-04-13 06:24:55 Re: Corruption during WAL replay