Condition variable live lock

From: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
To: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Condition variable live lock
Date: 2017-12-22 03:46:21
Message-ID: CAEepm=0NWKehYw7NDoUSf8juuKOPRnCyY3vuaSvhrEWsOTAa3w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi hackers,

While debugging a build farm assertion failure after commit 18042840,
and with the assumption that the problem is timing/scheduling
sensitive, I tried hammering the problem workload on a few different
machines and noticed that my slow 2-core test machine fairly regularly
got into a live lock state for tens to millions of milliseconds at a
time when there were 3+ active processes, in here:

int
ConditionVariableBroadcast(ConditionVariable *cv)
{
int nwoken = 0;

/*
* Let's just do this the dumbest way possible. We could try to dequeue
* all the sleepers at once to save spinlock cycles, but it's a bit hard
* to get that right in the face of possible sleep cancelations, and we
* don't want to loop holding the mutex.
*/
while (ConditionVariableSignal(cv))
++nwoken;

return nwoken;
}

The problem is that another backend can be woken up, determine that it
would like to wait for the condition variable again, and then get
itself added to the back of the wait queue *before the above loop has
finished*, so this interprocess ping-pong isn't guaranteed to
terminate. It seems that we'll need something slightly smarter than
the above to avoid that.

I don't currently suspect this phenomenon of being responsible for the
problem I'm hunting, even though it occurs on the only machine I've
been able to reproduce my real problem on. AFAICT the problem
described in this email should deliver arbitrary numbers of spurious
wake-ups wasting arbitrary CPU time but cause no harm that would
affect program correctness. So I didn't try to write a patch to fix
that just yet. I think we should probably back patch a fix when we
have one though, because it could bite Parallel Index Scan in
REL_10_STABLE.

--
Thomas Munro
http://www.enterprisedb.com

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Chapman Flack 2017-12-22 03:48:49 reassure me that it's good to copy pg_control last in a base backup
Previous Message Michael Paquier 2017-12-22 02:59:08 Re: [JDBC] [HACKERS] Channel binding support for SCRAM-SHA-256