Re: [COMMITTERS] pgsql: Use asynchronous connect API in libpqwalreceiver

From: Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andrew Dunstan <andrew(dot)dunstan(at)2ndquadrant(dot)com>
Cc: Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: [COMMITTERS] pgsql: Use asynchronous connect API in libpqwalreceiver
Date: 2017-03-16 12:00:54
Message-ID: 5eff6c57-eaca-c721-339b-3b621d87ce75@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-committers pgsql-hackers

On 15/03/17 17:55, Tom Lane wrote:
> Andrew Dunstan <andrew(dot)dunstan(at)2ndquadrant(dot)com> writes:
>> On 03/03/2017 11:11 PM, Tom Lane wrote:
>>> Yeah, I was wondering if this is just exposing a pre-existing bug.
>>> However, the "normal" path operates by repeatedly invoking PQconnectPoll
>>> (cf. connectDBComplete) so it's not immediately obvious how such a bug
>>> would've escaped detection.
>
>> (After a long period of fruitless empirical testing I turned to the code)
>> Maybe I'm missing something, but connectDBComplete() handles a return of
>> PGRESS_POLLING_OK as a success while connectDBStart() seems not to. I
>> don't find anywhere in our code other than libpqwalreceiver that
>> actually uses that interface, so it's not surprising if it's now
>> failing. So my bet is it is indeed a long-standing bug.
>
> Meh ... that argument doesn't hold water, because the old code here called
> PQconnectdbParams which is just PQconnectStartParams then
> connectDBComplete. So the problem cannot be in connectDBStart; that's
> common to both paths. It has to be some discrepancy between what
> connectDBComplete does and what the new loop in libpqwalreceiver is doing.
>
> The original loop coding in 1e8a85009 was not very close to the documented
> spec for PQconnectPoll at all, and while e434ad39a made it closer, it's
> still not really the same: connectDBComplete doesn't call PQconnectPoll
> until the socket is known read-ready or write-ready. The walreceiver loop
> does not guarantee that, but would make an additional call after any
> random other wakeup. It's not very clear why bowerbird, and only
> bowerbird, would be seeing such wakeups --- but I'm having a really hard
> time seeing any other explanation for the change in behavior. (I wonder
> whether bowerbird is telling us that WaitLatchOrSocket can sometimes
> return prematurely on Windows.)
>
> I'm also pretty sure that the ResetLatch call is in the wrong place which
> could lead to missed wakeups, though that's the opposite of the immediate
> problem.
>
> I'll try correcting these things and we'll see if it gets any better.
>

Looks like that didn't help either.

I setup my own Windows machine and can reproduce the issue. I played
around a bit and could not really find a fix other than adding
WL_TIMEOUT and short timeout to WaitLatchOrSocket (it does wait a very
long time on the WaitLatchOrSocket otherwise before failing).

So I wonder if this is the same issue that caused us using different
coding for WaitLatchOrSocket in pgstat.c (lines ~3918-3940).

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

In response to

Responses

Browse pgsql-committers by date

  From Date Subject
Next Message Alvaro Herrera 2017-03-16 15:51:50 pgsql: Fix ancient get_object_address_opf_member bug
Previous Message Stephen Frost 2017-03-16 04:14:00 pgsql: Be more careful about signed vs. unsigned char

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2017-03-16 12:25:55 Re: Speedup twophase transactions
Previous Message Peter van Hardenberg 2017-03-16 11:59:06 Error handling in transactions