Re: [EXTERNAL] Re: PQcancel does not use tcp_user_timeout, connect_timeout and keepalive settings

From: Jelte Fennema <Jelte(dot)Fennema(at)microsoft(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>, Zhihong Yu <zyu(at)yugabyte(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [EXTERNAL] Re: PQcancel does not use tcp_user_timeout, connect_timeout and keepalive settings
Date: 2022-01-18 00:35:36
Message-ID: AM5PR83MB01780E7649EC5802643666A5F7589@AM5PR83MB0178.EURPRD83.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

It seems the man page of TCP_USER_TIMEOUT does not align with
reality then. When I use it on my local machine it is effectively used
as a connection timeout too. The second command times out after
two seconds:

sudo iptables -A INPUT -p tcp --destination-port 5432 -j DROP
psql 'host=localhost tcp_user_timeout=2000'

The keepalive settings only apply once you get to the recv however. And yes,
it is pretty unlikely for the connection to break right when it is waiting for data.
But it has happened for us. And when it happens it is really bad, because
the process will be blocked forever. Since it is a blocking call.

After investigation when this happened it seemed to be a combination of a few
things making this happen:
1. The way citus uses cancelation requests: A Citus query on the coordinator creates
multiple connections to a worker and with 2PC for distributed transactions. If one
connection receives an error it sends a cancel request for all others.
2. When a machine is under heavy CPU or memory pressure things don't work
well:
i. errors can occur pretty frequently, causing lots of cancels to be sent by Citus.
ii. postmaster can be slow in handling new cancelation requests.
iii. Our failover system can think the node is down, because health checks are
failing.
3. Our failover system effectively cuts the power and the network of the primary
when it triggers a fail over to the secondary

This all together can result in a cancel request being interrupted right at that
wrong moment. And when it happens a distributed query on the Citus
coordinator, becomes blocked forever. We've had queries stuck in this state
for multiple days. The only way to get out of it at that point is either by restarting
postgres or manually closing the blocked socket (either with ss or gdb).

Jelte

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2022-01-18 00:50:08 Re: \d with triggers: more than one row returned by a subquery used as an expression
Previous Message Tom Lane 2022-01-18 00:14:34 Re: \d with triggers: more than one row returned by a subquery used as an expression