Re: [COMMITTERS] pgsql: Use asynchronous connect API in libpqwalreceiver

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Andrew Dunstan <andrew(dot)dunstan(at)2ndquadrant(dot)com>
Cc: Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: [COMMITTERS] pgsql: Use asynchronous connect API in libpqwalreceiver
Date: 2017-03-15 16:55:49
Message-ID: 7295.1489596949@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-committers pgsql-hackers

Andrew Dunstan <andrew(dot)dunstan(at)2ndquadrant(dot)com> writes:
> On 03/03/2017 11:11 PM, Tom Lane wrote:
>> Yeah, I was wondering if this is just exposing a pre-existing bug.
>> However, the "normal" path operates by repeatedly invoking PQconnectPoll
>> (cf. connectDBComplete) so it's not immediately obvious how such a bug
>> would've escaped detection.

> (After a long period of fruitless empirical testing I turned to the code)
> Maybe I'm missing something, but connectDBComplete() handles a return of
> PGRESS_POLLING_OK as a success while connectDBStart() seems not to. I
> don't find anywhere in our code other than libpqwalreceiver that
> actually uses that interface, so it's not surprising if it's now
> failing. So my bet is it is indeed a long-standing bug.

Meh ... that argument doesn't hold water, because the old code here called
PQconnectdbParams which is just PQconnectStartParams then
connectDBComplete. So the problem cannot be in connectDBStart; that's
common to both paths. It has to be some discrepancy between what
connectDBComplete does and what the new loop in libpqwalreceiver is doing.

The original loop coding in 1e8a85009 was not very close to the documented
spec for PQconnectPoll at all, and while e434ad39a made it closer, it's
still not really the same: connectDBComplete doesn't call PQconnectPoll
until the socket is known read-ready or write-ready. The walreceiver loop
does not guarantee that, but would make an additional call after any
random other wakeup. It's not very clear why bowerbird, and only
bowerbird, would be seeing such wakeups --- but I'm having a really hard
time seeing any other explanation for the change in behavior. (I wonder
whether bowerbird is telling us that WaitLatchOrSocket can sometimes
return prematurely on Windows.)

I'm also pretty sure that the ResetLatch call is in the wrong place which
could lead to missed wakeups, though that's the opposite of the immediate
problem.

I'll try correcting these things and we'll see if it gets any better.

regards, tom lane

In response to

Responses

Browse pgsql-committers by date

  From Date Subject
Next Message Tom Lane 2017-03-15 17:26:31 pgsql: Rewrite async-connection loop in libpqwalreceiver.c, once again.
Previous Message Robert Haas 2017-03-15 16:47:01 pgsql: Fix failure to use clamp_row_est() for parallel joins.

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2017-03-15 16:58:56 Re: background sessions
Previous Message Emre Hasegeli 2017-03-15 16:51:23 Re: Parallel Bitmap scans a bit broken