Re: [sqlsmith] stuck spinlock in pg_stat_get_wal_receiver after OOM

From: Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com>
To: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, Andres Freund <andres(at)anarazel(dot)de>, Andreas Seltenreich <seltenreich(at)gmx(dot)de>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, Petr Jelinek <petr(at)2ndquadrant(dot)com>
Subject: Re: [sqlsmith] stuck spinlock in pg_stat_get_wal_receiver after OOM
Date: 2017-10-03 21:40:28
Message-ID: 477434c1-b52a-3595-b54e-13cdf53c0f82@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 03/10/17 22:01, Thomas Munro wrote:
> On Wed, Oct 4, 2017 at 7:32 AM, Petr Jelinek
> <petr(dot)jelinek(at)2ndquadrant(dot)com> wrote:
>> On 03/10/17 19:44, Tom Lane wrote:
>>> I wrote:
>>>>> So that's trouble waiting to happen, for sure. At the very least we
>>>>> need to do a single fetch of WalRcv->latch, not two. I wonder whether
>>>>> even that is sufficient, though: this coding requires an atomic fetch of
>>>>> a pointer, which is something we generally don't assume to be safe.
>>>
>>> BTW, I had supposed that this bug was of long standing, but actually it's
>>> new in v10, dating to 597a87ccc9a6fa8af7f3cf280b1e24e41807d555. Before
>>> that walreceiver start/stop just changed the owner of a long-lived shared
>>> latch, and there was no question of stale pointers.
>>>
>>> I considered reverting that decision, but the reason for it seems to have
>>> been to let libpqwalreceiver.c manipulate MyProc->procLatch rather than
>>> having to know about a custom latch. That's probably a sufficient reason
>>> to justify some squishiness in the wakeup logic. Still, we might want to
>>> revisit it if we find any other problems here.
>> That's correct, and because the other users of that library don't have
>> special latch it seemed feasible to standardize on the procLatch. If we
>> indeed wanted to change the decision about this I think we can simply
>> give latch as a parameter to libpqrcv_connect and store it in the
>> WalReceiverConn struct. The connection does not need to live past the
>> latch (although it currently does, but that's just a matter of
>> reordering the code in WalRcvDie() a little AFAICS).
>
> I wonder if the new ConditionVariable mechanism would be worth
> considering in future cases like this, where the signalling process
> doesn't know the identity of the process to be woken up (or even how
> many waiting processes there are), but instead any interested waiters
> block on a particular ConditionVariable that models a specific
> interesting condition. In the end it's just latches anyway, but it
> may be a better abstraction. On the other hand I'm not sure how waits
> on a ConditionVariable would be multiplexed with IO (a distinct wait
> event, or leaky abstraction where the caller relies on the fact that
> it's built on MyProc->latch, something else?).
>

In this specific case the problem is that what we are waiting for is
indeed the Latch because we want the libpqwalreceiver calls to be
interruptible when waiting for I/O.

Otherwise, yes ConditionVariable is useful for generic signaling, we use
it in slot and origin for "lock" waits (they don't have normal catalog
so normal locking does not work).

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2017-10-03 22:10:46 Re: parallelize queries containing initplans
Previous Message Maksim Milyutin 2017-10-03 21:36:00 Re: [BUG] Cache invalidation for queries that contains const of temporary composite type