Re: Deadlock in libpq

From: Merlin Moncure <mmoncure(at)gmail(dot)com>
To: Erik Hesselink <hesselink(at)gmail(dot)com>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: Deadlock in libpq
Date: 2011-03-24 14:21:19
Message-ID: AANLkTikSR0g9KMDYTAY7j_5s5SQpnc0W=CS1wxw4jq1G@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Thu, Mar 24, 2011 at 9:07 AM, Erik Hesselink <hesselink(at)gmail(dot)com> wrote:
> On Thu, Mar 24, 2011 at 14:23, Merlin Moncure <mmoncure(at)gmail(dot)com> wrote:
>> On Thu, Mar 24, 2011 at 4:17 AM, Erik Hesselink <hesselink(at)gmail(dot)com> wrote:
>>> Hi,
>>>
>>> We're getting a deadlock in our application (a web application with a
>>> PostgreSQL backend) which I've traced to libpq. I've started our
>>> application in gdb, and when it hangs, I've inspected the backtraces.
>>> I've found a couple of threads I can account for (listening for new
>>> connections, background processes) and 77 threads waiting for a mutex
>>> lock:
>>>
>>> #0  0x00007ffff523d464 in __lll_lock_wait () from /lib/libpthread.so.0
>>> #1  0x00007ffff52385d9 in _L_lock_953 () from /lib/libpthread.so.0
>>> #2  0x00007ffff52383fb in pthread_mutex_lock () from /lib/libpthread.so.0
>>> #3  0x00007ffff6160650 in ?? () from /usr/lib/libpq.so.5
>>>      ==> pg_lockingcallback
>>> #4  0x00007ffff440b791 in ?? () from /lib/libcrypto.so.0.9.8
>>> #5  0x00007ffff440bcc9 in ?? () from /lib/libcrypto.so.0.9.8
>>> #6  0x00007ffff47652fb in SSL_new () from /lib/libssl.so.0.9.8
>>> #7  0x00007ffff61604dc in ?? () from /usr/lib/libpq.so.5
>>>      ==> pqsecure_open_client
>>> #8  0x00007ffff61525ce in PQconnectPoll () from /usr/lib/libpq.so.5
>>> #9  0x00007ffff6152f5e in ?? () from /usr/lib/libpq.so.5
>>>      ==> connectDBComplete
>>> #10 0x00007ffff6153c5f in PQconnectdb () from /usr/lib/libpq.so.5
>>> #11 0x0000000000f9b518 in sccR_info ()
>>> #12 0x0000000000000000 in ?? ()
>>>
>>> So it seems everything is waiting for a lock on a mutex from
>>> pq_lockarray (in fe-secure(dot)c(at)846). Does anybody have any idea how this
>>> can happen? Is this something we're doing wrong (I hope so) or a bug
>>> in libpq?
>>>
>>> Some background: this happens only after a couple of thousand requests
>>> (each doing about 15 database calls), with occasional other requests
>>> coming in at the same time. Our server uses a Haskell binding to libpq
>>> (HDBC [1] and HDBC-postgresql [2]). Both client and server run on the
>>> same machine, running 64bit Ubuntu 10.04. The database version is
>>> "PostgreSQL 8.4.7 on x86_64-pc-linux-gnu, compiled by GCC gcc-4.4.real
>>> (Ubuntu 4.4.3-4ubuntu5) 4.4.3, 64-bit". I'm not sure how to determine
>>> the libpq version, but it is the most recent that comes with this
>>> ubuntu. The changelogs for Ubuntu suggest 8.4.7 as well. Connections
>>> are via TCP/IP to 127.0.0.1 with SSL turned on. The machine is under
>>> some CPU load when this happens. There is plenty of free memory.
>>>
>>> When I turned off SSL or connect via domain sockets, we got different
>>> errors that are possibly related: occasionally, the connection between
>>> client (our app) and server (database) is lost. On the client, we get:
>>>
>>>    connectPostgreSQL: server closed the connection unexpectedly
>>>    This probably means the server terminated abnormally
>>>    before or while processing the request.
>>>
>>> and on the server:
>>>
>>>    could not send data to client: Broken pipe
>>>
>>> There is no further context around these messages.
>>>
>>> Any help would be greatly appreciated.
>>
>> How did you initialize ssl?   You are waiting inside a lock that is
>> getting set up inside the crypto library.  Unless you are having some
>> type of library initialization issue, I'm suspicious the problem is
>> really inside libpq.  Is your application multithreaded, and if so are
>> you properly synchronizing access to the connection object, etc?
>
> What do you mean exactly with "How did you initialize ssl"? I found
> [1], which I did not know about. This seems to be a very non-local
> problem: if one of our dependencies initializes ssl, and I use libpq
> as well, this will go wrong. I've done a quick look through all our
> dependencies, and none seem to use libcrypto or libssl.

*something* must be initializing ssl, or you can't make secure
connections from libpq. you need to find out which pq ssl init
function is begin called, when it is being called, and with what
arguments. One of the main things PQInitSSL does is set up a lock
vector which it passes to the crypto library. The fact you are having
blocking issues around those locks is suggesting SSL was not set up
properly, something happened after being set up so that the locks are
no longer good, you have application thread issue (although that
sounds unlikely), or (least likely worst case) there is a bug in
crypto.

merlin

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Erik Hesselink 2011-03-24 14:48:51 Re: Deadlock in libpq
Previous Message Erik Hesselink 2011-03-24 14:07:51 Re: Deadlock in libpq