libpq bug

From: "Kirby Bohling (TRSi)" <kbohling(at)oasis(dot)novia(dot)net>
To: pgsql-bugs(at)postgresql(dot)org
Subject: libpq bug
Date: 2000-09-15 15:03:18
Message-ID: Pine.GSO.4.21.0009151001590.14400-100000@oasis.novia.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Your name : Kirby C. Bohling
Your email address : kbohling(at)oasis(dot)novia(dot)net


System Configuration ---------------------
Architecture (example: Intel Pentium) : Intel PII 550

Operating System (example: Linux 2.0.26 ELF) : FreeBSD 4.0 Release

PostgreSQL version (example: PostgreSQL-7.0): PostgreSQL-7.0.2

Compiler used (example: gcc 2.8.0) : gcc 2.95.2 19991024 (release)


Please enter a FULL description of your problem:
------------------------------------------------
I have an C++ application that runs for extended periods of time that
keeps open the same postgres connection forever. After running for some
period of time, the code will hang, after attaching with gdb, it is always
hung in the same spot. fe-misc.c: 739, which is a call to select. I
haven't compiled with debugging information, so I can't tell what it is
waiting on. After reviewing the logs, I get a SIGPIPE, and
"PQsendQuery -- There is no connection to the back end". I believe that
the backend has died, and this is the symptom of that.

The one thing I noticed, is that the code only hangs when I tried to start
a transaction. After close examination, I realized that the only thing
different is that didn't call PQstatus(), before making PQexec(). I
have investigated the code in libpq.

This is my guess at the stack trace, I don't have the code compiled with
debugging, and I haven't got the time to do that, and wait around for the
bug to happen again.

#0 0xXXXXX in pqWait at pqWait.c:739
#1 0xXXXXX in PQgetResult at fe-exec.c:1126
#2 0xXXXXX in PQexec at fe-exec.c:1204
#3 0xXXXXX in myFuncThatCallsPQexec() myFuncs.c: 1234

If you follow the code from the entry into PQexec, all that way into
pqWait, and then down into the select call, I noticed that nowhere in the
path of execution did it check conn->status to see if the status was
CONNECTION_OK, it only checked to see if the socket non-negative. This
was by visual inspection, but using a debugger, so double check that.

If my guess is correct, the backend has gone away, select can't tell that
you are never going to be able to read or write on that socket. It might
break out of the deadlock if the select call passed in the file descriptor
to the exeception fd list (NOTE: Not all select()'s are the same. I
ran across serious problems with code that depended on the way AIX handled
exception fd's versus the way Solaris 2.6 did, that discussion way, way
beyond the scope of this email). My guess is that the connection status
is CONNECTION_BAD, I can't tell, the debugger won't help me out, because
at libpq-fe.h:86

typedef struct pg_conn PGConn;

Nice opaque typedef, but no way for me to print the structure in a
debugger short of printing the raw memory.

I believe that I have written the work-around for my project, I wrote a
wrapper call to PQexec that always calls PQstatus, and fakes the error
codes if PQstatus is bad, and my problems seem to magically
disappear. The program resets it connection if the connection goes
south, and life is great.

My guess is that somewhere along the way, PQstatus() should be called, or
conn->status should be checked. I am not sure the the most appropriate
place to put the fix. There might also be some very good reason that it
isn't there.

All that being said, I believe that the bug is my fault, for failing to
check the connection status before calling PQexec(). But the removal of
a infinite waiting condition seems to be pretty valuable to me,
even if the infinite wait is due to a lazy programmer. Hence I took the
time to figure this much out.

Please describe a way to repeat the problem. Please try to
provide a concise reproducible example, if at all possible:
----------------------------------------------------------------------




If you know how this problem might be fixed, list the solution below:
---------------------------------------------------------------------
My best guess is the following:

Add the following lines to pqWait()

/* I am not sure if any other cases should be or'ed with this, but I know
that looking for != CONNECTION_OK is a bad idea, as while initiating a
connection, the state is not CONNECTION_OK, but pqWait is called */
if( conn->status == CONNECTION_BAD )
{
printfPQExpBuffer( &conn->errorMessage, "pqWait() -- bailing,
connection is bad\n");
return EOF;
}

Thanks if you read this far, I would have give up by now...

Kirby

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Tom Lane 2000-09-15 19:20:57 Re: libpq bug
Previous Message Ronald Kuczek 2000-09-12 07:51:16 Failed to compile on Win32 with --enable locale