Re: Race condition in SyncRepGetSyncStandbysPriority

From: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
To: masahiko(dot)sawada(at)2ndquadrant(dot)com
Cc: tgl(at)sss(dot)pgh(dot)pa(dot)us, masao(dot)fujii(at)oss(dot)nttdata(dot)com, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Race condition in SyncRepGetSyncStandbysPriority
Date: 2020-04-14 09:35:38
Message-ID: 20200414.183538.344904915964423903.horikyota.ntt@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

At Tue, 14 Apr 2020 13:06:14 +0900, Masahiko Sawada <masahiko(dot)sawada(at)2ndquadrant(dot)com> wrote in
> On Tue, 14 Apr 2020 at 10:34, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> >
> > Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com> writes:
> > > At Sat, 11 Apr 2020 18:30:30 -0400, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote in
> > >> What I think we should do about this is, essentially, to get rid of
> > >> SyncRepGetSyncStandbys. Instead, let's have each walsender advertise
> > >> whether *it* believes that it is a sync standby, based on its last
> > >> evaluation of the relevant GUCs. This would be a bool that it'd
> > >> compute and set alongside sync_standby_priority. (Hm, maybe we'd not
> >
> > > Mmm.. SyncRepGetStandbyPriority returns the "priority" that a
> > > walsender thinks it is at, among synchronous_standby_names. Then to
> > > decide "I am a sync standby" we need to know how many walsenders with
> > > higher priority are alive now. SyncRepGetSyncStandbyPriority does the
> > > judgment now and suffers from the inconsistency of priority values.
> >
> > Yeah. After looking a bit closer, I think that the current definition
> > of sync_standby_priority (that is, as the result of local evaluation
> > of SyncRepGetStandbyPriority()) is OK. The problem is what we're doing
> > with it. I suggest that what we should do in SyncRepGetSyncRecPtr()
> > is make one sweep across the WalSnd array, collecting PID,
> > sync_standby_priority, *and* the WAL pointers from each valid entry.
> > Then examine that data and decide which WAL value we need, without assuming
> > that the sync_standby_priority values are necessarily totally consistent.
> > But in any case we must examine each entry just once while holding its
> > mutex, not go back to it later expecting it to still be the same.

SyncRepGetSyncStandbysPriority() is runing holding SyncRepLock so
sync_standby_priority of any walsender can be changed while the
function is scanning welsenders. The issue is we already have
inconsistent walsender information before we enter the function. Thus
how many times we scan on the array doesn't make any difference.

I think we need to do one of the followings.

A) prevent SyncRepGetSyncStandbysPriority from being entered while
walsender priority is inconsistent.

B) make SyncRepGetSyncStandbysPriority be tolerant of priority
inconsistency.

C) protect walsender priority array from beinig inconsistent.

The (B) is the band aids. To achieve A we need to central controller
of priority config handling. C is:

> Can we have a similar approach of sync_standby_defined for
> sync_standby_priority? That is, checkpionter is responsible for
> changing sync_standby_priority of all walsenders when SIGHUP. That
> way, all walsenders can see a consistent view of
> sync_standby_priority. And when a walsender starts, it sets
> sync_standby_priority by itself. The logic to decide who's a sync
> standby doesn't change. SyncRepGetSyncRecPtr() gets all walsenders
> having higher priority along with their WAL positions.

Yeah, it works if we do , but the problem of that way is that to
determin priority of walsenders, we need to know what walsenders are
running. That is, when new walsender comes the process needs to aware
of the arrival (or leaving) right away and reassign the priority of
every wal senders again.

If we accept to share variable-length information among processes,
sharing sync_standby_names or parsed SyncRepConfigData among processes
would work.

> >
> > Another thing that I'm finding interesting is that I now see this is
> > not at all new code. It doesn't look like SyncRepGetSyncStandbysPriority
> > has changed much since 2016. So how come we didn't detect this problem
> > long ago? I searched the buildfarm logs for assertion failures in
> > syncrep.c, looking back one year, and here's what I found:
...
> > The line numbers vary in the back branches, but all of these crashes are
> > at that same Assert. So (a) yes, this does happen in the back branches,
> > but (b) some fairly recent change has made it a whole lot more probable.
> > Neither syncrep.c nor 007_sync_rep.pl have changed much in some time,
> > so whatever the change was was indirect. Curious. Is it just timing?
>
> Interesting. It's happening on certain animals, not all. Especially
> tests with HEAD on sidewinder and curculio, which are NetBSD 7 and
> OpenBSD 5.9 respectively, started to fail at a high rate since a
> couple of days ago.

Coundn't this align the timing of config reloading? (I didn't checked
anything yet.)

| commit 421685812290406daea58b78dfab0346eb683bbb
| Author: Noah Misch <noah(at)leadboat(dot)com>
| Date: Sat Apr 11 10:30:00 2020 -0700
|
| When WalSndCaughtUp, sleep only in WalSndWaitForWal().
| Before sleeping, WalSndWaitForWal() sends a keepalive if MyWalSnd->write
| < sentPtr. That is important in logical replication. When the latest
| physical LSN yields no logical replication messages (a common case),
| that keepalive elicits a reply, and processing the reply updates
| pg_stat_replication.replay_lsn. WalSndLoop() lacks that; when
| WalSndLoop() slept, replay_lsn advancement could stall until
| wal_receiver_status_interval elapsed. This sometimes stalled
| src/test/subscription/t/001_rep_changes.pl for up to 10s.
|
| Discussion: https://postgr.es/m/20200406063649.GA3738151@rfd.leadboat.com

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Pavel Stehule 2020-04-14 09:45:39 Re: Display of buffers for planning time show nothing for second run
Previous Message Julien Rouhaud 2020-04-14 09:34:56 Re: Display of buffers for planning time show nothing for second run