Re: psycopg2 (async) socket timeout

From: Jan Urbański <wulczer(at)wulczer(dot)org>
To: Marko Kreen <markokr(at)gmail(dot)com>
Cc: Danny Milosavljevic <danny(dot)milo(at)gmail(dot)com>, psycopg(at)postgresql(dot)org
Subject: Re: psycopg2 (async) socket timeout
Date: 2011-02-15 22:13:17
Message-ID: 4D5AFA7D.4080202@wulczer.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: psycopg

On 15/02/11 21:55, Marko Kreen wrote:
> On Tue, Feb 15, 2011 at 3:32 PM, Jan Urbański <wulczer(at)wulczer(dot)org> wrote:
>> * the the app sends a keepalive, receives response
>
> Sort of true, except Postgres does not have app-level
> keepalive (except SELECT 1). The PQping mentioned
> earlier creates new connection.

By this I meant that an app is connected using libpq with keepalives
enabled.

>> * the connection is idle
>> * before the next keepalive is sent, you want to do a query
>> * the connection breaks silently
>> * you try sending the query
>> * libpq tries to write the query to the conncetion socket, does not
>> receive TCP confirmation
>
> The TCP keepalive should help for those cases, perhaps
> you are doing something wrong if you are not seeing the effect.

Well for me it doesn't help, I'm not sure if it's my fault or the
kernel's or it's just how TCP ought to work.

>> * the kernel starts retransmitting the data, using TCP's RTO algorithm
>> * you don't get notified about the failure until the TCP gives up, which
>> might be a long time
>
> I'm not familiar with RTO, so cannot comment.
>
> Why would it stop keepalive from working?

Looking at the traffic in Wireshark I'm seeing TCP retransmissions and
no keepalive traffic.

> The need for periodic query is exactly the thing that keepalive
> should fix. OTOH, if you have connections that are long time idle
> you could simply drop them.
>
> We have the (4m idle + 4x15sec ping) parameters as
> default and they work fine - dead connection is killed
> after 5m.

Hm, so my test is like this:

* I connect with psycopg2 enabling keepalives in the connection string,
using "keepalives_idle=4 keepalives_interval=1 keepalives_count=1"
* the test program sends a "select pg_sleep(6)" and then sleeps itself
for 6 seconds, and does that in a loop
* each time after the query is sent and 4 seconds elapse I'm seeing TCP
keepalive packets going to the server and the server responding
* each time after the program sleeps for 4 seconds, a keepalive is sent

To simulate a connectivity loss I'm adding two rules to my firewall that
block (the iptables DROP target) communication from or to port 5432.

Now there are two scenarios:

1. if I block the connection right after the test program goes to sleep,
the response to the keepalive is not received and a connectivity loss is
detected. The app sends a RST packet (that obviously does not reach the
server) and when it wakes up and tries to send the query, psycopg
complains about a broken connection. Important: the backend stays alive
and PG shows the connection as "IDLE in transaction".

2. if I block the connection after the test program already sent the
keepalive, but before it sent the query it actually goes ahead and tries
to send the query, and then blocks because the kernel is retrying the
TCP delivery of the query. Keepalives are *not* sent and the process of
TCP giving up can take quite some time (depends on the settigs for TCP
timeout). The connection stays alive on the server anyway.

3. if I block the connection while it's waiting for the query to
complete, a keepalive is sent, the connection is detected to be broken,
the execute statement fails with a SystemError: null argument to
internal routine (sic!), and the connection stays on the server.

I'm not really sure what's the deal with the SystemError, but my
conclusions are:

* while TCP retry is in action, it disables keepalives
* the backend stays alive on the server side anyway

It's possible that TCP retries take a few minutes and I'm simply not
patient enough (of course I'm not using a keepalive interval of 1 second
in production). So if all you want is to detect a broken connection a
couple of minutes from the moment it happened, you can have client-side
keepalives tuned as Marko does it, and check that your TCP stack gives
up a delivery attempt in less then a few minutes.

On the other hand, you probably *should* also use server-side
keepalives, so the server can detect a broken connection and kill the
backend, otherwise you will end up with lots of "IDLE in connection"
backends, which is Very Bad (can block autovacuum, still holds
transaction locks etc).

I'm going to do some more tests to see the default timeout for TCP
delivery and if it's really in the range of 5 minutes, I'll be happy.

Now I don't have any clue what's with the SystemError I'm getting, might
take a look if I find the time. Attached is the test script and the
command I use to simulate network outage.

And last but not least, txpostgres does not play that well with
client-side keepalives, because while the connection is idle, it's not
watching its file descriptor, so error conditions on that descriptor
will be detected only when you go and do a query. That is something I
might fix in the future.

Gah, looking at all that TCP stuff always makes my head spin.

Cheers,
Jan

Attachment Content-Type Size
keepalive.py text/x-python 215 bytes
break-network text/plain 120 bytes

In response to

Browse psycopg by date

  From Date Subject
Next Message Barend Köbben 2011-02-16 12:12:07 Re: psycopg used in a ASP page fails
Previous Message Marko Kreen 2011-02-15 20:55:57 Re: psycopg2 (async) socket timeout