Re: BUG #5837: PQstatus() fails to report lost connection

From: "Murray S(dot) Kucherawy" <msk(at)cloudmark(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "pgsql-bugs(at)postgresql(dot)org" <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: BUG #5837: PQstatus() fails to report lost connection
Date: 2011-01-24 05:59:42
Message-ID: F5833273385BB34F99288B3648C4F06F1341E73FC6@EXCH-C2.corp.cloudmark.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

> -----Original Message-----
> From: Robert Haas [mailto:robertmhaas(at)gmail(dot)com]
> Sent: Sunday, January 23, 2011 8:59 PM
> To: Murray S. Kucherawy
> Cc: Tom Lane; pgsql-bugs(at)postgresql(dot)org
> Subject: Re: [BUGS] BUG #5837: PQstatus() fails to report lost connection
>
> > As for the reply above, I disagree.  PQstatus(), as documented,
> > doesn't say anything about certain conditions in which it won't report
> > that the connection is dead, when it actually is, once the connection
> > was already established and working.
>
> I understand that gut feeling, but I think if you think about it a
> little, you may realize that at some level it's an unreasonable
> expectation. For example, suppose you were to connect to the server
> (successfully), unplug the network cable, and then call PQstatus().
> There is literally no way for PQstatus() to know whether that
> connection is still there. It's just returning some internal flag out
> of the object.

True, of course. But that's not the case here; PQsendRequest() was used to try to contact the server to submit some SQL operation after the server had been restarted. Enough of that operation failed so as to make PQresultStatus() return PQRES_FATAL_ERROR. There was enough I/O for libpq to figure out that something has gone wrong, in fact enough to know that the cause was administrative termination of the server for some reason (since the error string returned is "FATAL: terminating connection due to administrator command"). Basically it appears, then, that enough communication (or attempts at it) took place to observe that the connection not usable anymore and mark it as such, and even why.

Again, I'm looking at this as a consumer of the API; I don't know, and to some degree shouldn't care, what's going on inside. The results simply look inconsistent.

I don't expect libpq to do any kind of keep-alive work, but once something has reported PGRES_FATAL_ERROR, I would expect the result returned by PQstatus() to be accurate. In this case, it isn't. It gives the impression that an incomplete state change occurred or something like that.

> Now, in this case, things are a bit less clear, because it's not as if
> the remote side has given us no indication that the connection is
> about to get closed. This doesn't strike me as awesome protocol
> design, but 14 years on we're probably stuck with it. I think if you
> want to have some special handling when an administrative shutdown
> happens on the server, you should use PQresultErrorField() to check
> PG_DIAG_SQLSTATE; or if you want FATAL and PANIC conditions to be
> handled specially you can check PG_DIAG_SEVERITY.

The model in OpenDBX appears to be to query and then iterate getting results until there are none; if there's ever an error, see if it's a kind of error that requires a reconnection attempt before continuing. Without assuming PGRES_FATAL_ERROR always means a reconnect is required, it seems like PQstatus() is the right test for the latter of those two "ifs", and that's what they're doing. It doesn't really matter if the disconnect was a result of administrative action or not; the caller just needs to know if it has to try to reconnect before continuing.

> The description of PGRES_FATAL_ERROR in the documentation does pretty
> much suck. I believe you'll get that if the backend returns either an
> ERROR (in which case you need to retry the transaction) or a FATAL (in
> which case you need to reset the connection), but that's not at all
> clear either from the documentation or the naming of the constant
> (which, alas, is hard to change at this point for
> backward-compatibility reasons).

Well my other suggestion would be to assume PGRES_FATAL_ERROR always means the connection needs to be reset. But this blows that idea away; this would cause a connection reset that wouldn't actually solve the problem when it's an ERROR as you described.

> The description of PQstatus() looks correct to me. It says that "a
> communications failure might result in the status changing to
> CONNECTION_BAD prematurely". In the scenario you describe, no
> communications failure has occurred. The server sent back an error
> message, and closed the connection (though libpq hasn't noticed yet)
> but there's no communications error anywhere until the client tries to
> send a query over a connection that doesn't exist any more. Maybe we
> could flesh that description out a bit to make it more clear that this
> is really only going to tell you if TCP/IP has explicitly blown up in
> your face, and not any other reason why things might not be working,
> although I think there are hints of that there already.

Given what you say here, this seems to be an exception for which there is currently no detection mechanism short of looking for that one specific error string indicating it was administrative action causing the error. I think your suggested changes would certainly help, as well as some generalized way to know that the connection handle is now unusable, either because of administrative server action or because the connection became unreliable for some other reason. I'd be fine with having to do PQstatus() || PQadminStatus() too, if such a thing were to appear. Better documentation of PGRES_FATAL_ERROR would help as well.

-MSK

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Darshana 2011-01-24 06:46:19 BUG #5844: maverick
Previous Message Xiaobo Gu 2011-01-24 05:28:39 Re: [HACKERS] Is there a way to build PostgreSQL client libraries with MinGW