PQConsumeinput stuck on recv

From: Andre Oliveira Freitas <afreitas(at)callixbrasil(dot)com>
To: pgsql-general(at)postgresql(dot)org
Subject: PQConsumeinput stuck on recv
Date: 2018-02-23 16:33:18
Message-ID: CAN6ijTDiFnzeWyDkbcL9dYj6-G9nQAsW1gL6OK+bQMrUBgn8iQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hi, I've been experiencing an issue. We use an open-source VoIP software
whose backend is PostgreSQL. Initially we had a twin-server setup (one
server running the VoIP software, another one running the pg instance). Due
to company growth we were running into performance issues, so we rolled out
a new architecture using multiple VoIP servers connected to the single pg
instance. Since then, the VoIP software started misbehaving - it randomly
stops responding, and only a restart gets it back up running. It is random
throughout the servers, time-of-day, day-of-week... we haven't found a
correlation between it and any other metric like CPU usage, network traffic
and such.

Since it's been happening for a few weeks now, every time it freezes we
take a gcore dump and check it in gdb... and after a lot of hair pulling
and learning about the innards of the VoIP software we see that most often
the software is stuck in this call trace:

#0 in __libc_recv (fd=409, buf=0x7f2c4802e6c0, n=16384, flags=1898970523)
at ../sysdeps/unix/sysv/linux/x86_64/recv.c:33
#1 in ?? () from /usr/lib/x86_64-linux-gnu/libpq.so.5
#2 in ?? () from /usr/lib/x86_64-linux-gnu/libpq.so.5
#3 in PQconsumeInput () from /usr/lib/x86_64-linux-gnu/libpq.so.5

The software shares a database connection between threads, and controls its
access through a mutex, so once one thread that acquires the mutex gets
stuck in the location above, all other threads starts pilling up behind the
mutex, and that's apparently the reason the software stops responding for
most of its functions (while other functions that do not depend on the
database works normally).

And it stays stuck on it forever... at one situation we took two gcore
dumps spaced 10 minutes apart, and they look almost identical, with the
same thread stuck on recv and all the others waiting for the lock to be
released.

I wonder if anyone has any tip on what to look for next... Besides the
implementation of the VoIP software itself, we are looking into network
issues (we are seeing a bunch of TCP retransmissions between some servers
and the db), but otherwise no other app running on those servers has
presented any weird behavior like this VoIP software. We don't understand
what would cause recv to get stuck like this.

BTW we're running debian 9, pg 9.6.3, and the VoIP sofware (along most of
the other apps) uses libpq of a slightly older version (9.4.15).

Thanks in advance.

Andre

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Andres Freund 2018-02-23 17:20:02 Re: PQConsumeinput stuck on recv
Previous Message Viktor Fougstedt 2018-02-23 14:42:19 Re: Given a set of daterange, finding the continuous range that includes a particular date