libpq: indefinite block on poll during network problems

From: Dmitry Samonenko <shreddingwork(at)gmail(dot)com>
To: pgsql-general(at)postgresql(dot)org
Subject: libpq: indefinite block on poll during network problems
Date: 2014-05-27 07:51:39
Message-ID: CAFKp+3chP+RWzsznpky_d7-QQjGQo84Q6U4RGOgRjGZjPvPm3w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

I have an application which uses libpq for interaction with remote
PostgreSQL 9.2.4 server. Clients and Server nodes are running Linux and
connection is established using TCPv4. The client application has some
small fault-tolerance features, which are activated when server related
problems are encountered.

One day some bad things happened with network layer hardware and, long
story short, host with PSQL server got isolated. All TCP messages routed to
server node were NOT delivered or acknowledged in any way. Client
application got blocked in libpq code according to debugger.

I have successfully reproduced the problem in the laboratory environment.
These iptables commands should be run on the server node after some period
of client <-> server interaction:

# iptables -A OUTPUT -p tcp --sport 5432 -j DROP
# iptables -A INPUT -p tcp --dport 5432 -j DROP

I made a glimpse over master branch of libpq sources and some questions
arose. Namely:

1. Connection to PSQL server is made without an option to specify
SO_RCVTIMEO and SO_SNDTIMEO. Why is that? Is setting socket timeouts
considered harmful?

2. PQexec ultimately leads to PQwait, which after some function calls
"lands" in pqSocketCheck and pqSocketPoll. These 2 functions have parameter
end_time. It is set (-1) for PQexec scenario, which leads to infinite poll
timeout in pqSocketPoll. Is it possible to implement configurable timeout
for PQexec calls? Is there some implemented features, which should be used
to handle situation like this?

Currently, I have changed Linux kernel tcp4 stack counters responsible for
retransmission, so OS actually closes socket after some period. This is
detected by pqSocketPoll's poll and libpq handles situation correctly -
error is reported to my application. But it's just a workaround.

So, this infinite poll situation looks like imperfection to me and I think
it should be considered as a bug. Is it?

With regards,
Dmitry Samonenko

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Andrej Vanek 2014-05-27 08:25:29 hidden junk files in ...data/base/oid/
Previous Message Khangelani Gama 2014-05-27 04:15:53 Re: postgreSQL : duplicate DB names