Re: Fwd: libpq: indefinite block on poll during network problems

From: Dmitry Samonenko <shreddingwork(at)gmail(dot)com>
To: Adrian Klaver <adrian(dot)klaver(at)aklaver(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-general(at)postgresql(dot)org, amit(dot)kapila16(at)gmail(dot)com
Subject: Re: Fwd: libpq: indefinite block on poll during network problems
Date: 2014-05-29 08:27:50
Message-ID: CAFKp+3cbU3s-V-HEUvg-n+Qx4G4kCD6=n8jxuvK1ORV6K_uayQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Guys, first of all: thank you for you help and cooperation. I have received
several mails suggesting tweaks for tcp_keepalive and usage of postgresql
server functions/features (cancel, statement timeout), but as I said - it
won't help.

I have reproduced the problem scenario. Logs are attached. I walk you
through.

== Setup ==
Client and server applications are placed on separate hosts. Client =
192.168.15.4, Server = 192.168.15.7. Both are in local net. Both are
synchronized using 3rd party NTP server. Lets look in strace_export.txt -
top 8 lines = socket setup. Keepalive option is set. Client's OS keepalive
parameters:

[root(at)krr2srv1wsn1 dtp_generator]# sysctl -a | grep keepalive
net.ipv4.tcp_keepalive_intvl = 5
net.ipv4.tcp_keepalive_probes = 3
net.ipv4.tcp_keepalive_time = 10

This means that after 10 seconds of idle connection first TCP Keep-Alive
probe is sent. If 3 probes with 5 second interval fail - connection should
be considered dead.

Server configuration is in postgresql.conf.

== Part 1. TCP Keep Alive ==
At 11:25:35.847138 connection to the server is made and the first query is
sent. Got response fast at 11:25:35.858582. No other queries were made for
the next minute to catch keep alive packets. Wireshark 1.8.2 marks 13 - 36
frames as Keep-Alive, so we can see that it's configured right and
definitely works.

== Part 2. The Problem ==
At 11:26:40.933017 queries generation is started on client side. Client is
configured to perform 1 request per second. After some arbitrary time next
command is executed on server node:
[root(at)cluster1]# date && iptables -A OUTPUT -p tcp --sport 5432 -j DROP &&
iptables -A INPUT -p tcp --dport 5432 -j DROP

11:26:47 is outputed to console. As you can see in client trace file, this
time corresponds to frame 55 - the last query is made. strace shows send &&
poll syscalls. And... that's it. The client got blocked on poll.

== Part 3. The aftermath ==
The Client was blocked ~2 minutes. I killed application with SIGTERM, which
you can see in strace. At the time application was still waiting on libpq's
poll. The Pcap file show no trace of keep-alive packets after server was
isolated with iptable's rules. As I said earlier: TCP Keep-Alive is done on
idle connection only. When TCP retransmission kicks-in - TCP Keep-Alive is
not performed.

Let me repeat myself again: the problem is NOT with the server. The problem
is with libpq's PGgetResult which ultimately leads to very optimistic poll
routine.

Thank you.

With regards, Dmitry Samonenko.

Attachment Content-Type Size
strace_export.txt text/plain 4.7 KB
client.pcap application/x-extension-pcap 7.4 KB
server.pcap application/x-extension-pcap 7.3 KB
postgresql.conf application/octet-stream 1.6 KB

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Martijn van Oosterhout 2014-05-29 11:45:27 Re: Fwd: libpq: indefinite block on poll during network problems
Previous Message xbzhang 2014-05-29 07:25:10 How to implement the skip errors for copy from ?