libpq copy error handling busted

From: Andres Freund <andres(at)anarazel(dot)de>
To: pgsql-hackers(at)postgresql(dot)org, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: libpq copy error handling busted
Date: 2020-06-03 20:12:42
Message-ID: 20200603201242.ofvm4jztpqytwfye@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

When libpq is used to COPY data to the server, it doesn't properly
handle errors.

An easy way to trigger the problem is to start pgbench -i with a
sufficiently large scale, and then just shut the server down. pgbench
will happily use 100% of the cpu trying to send data to the server, even
though libpq knows that the connection is broken.

It can't even be cancelled using ctrl-c anymore, because the cancel
request cannot be sent:

andres(at)awork3:~/src/postgresql$ pgbench -i -s 4000 -q
dropping old tables...
creating tables...
generating data (client-side)...
80889300 of 400000000 tuples (20%) done (elapsed 85.00 s, remaining 335.33 s)
^CCould not send cancel request: PQcancel() -- connect() failed: No such file or directory

This is partially an old problem, and partially got recently
worse. Before the below commit we detected terminated connections, but
we didn't handle copy failing.

The failure to handle terminated connections originates in:

commit 1f39a1c0641531e0462a4822f2dba904c5d4d699
Author: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Date: 2019-03-19 16:20:20 -0400

Restructure libpq's handling of send failures.

The problem is basically that pqSendSome() returns *success* in all
failure cases. Both when a failure is already known:

+ /*
+ * If we already had a write failure, we will never again try to send data
+ * on that connection. Even if the kernel would let us, we've probably
+ * lost message boundary sync with the server. conn->write_failed
+ * therefore persists until the connection is reset, and we just discard
+ * all data presented to be written.
+ */
+ if (conn->write_failed)
+ {
+ /* conn->write_err_msg should be set up already */
+ conn->outCount = 0;
+ return 0;
+ }
+

and when initially "diagnosing" the failure:
/* Anything except EAGAIN/EWOULDBLOCK/EINTR is trouble */
switch (SOCK_ERRNO)
...
/* Discard queued data; no chance it'll ever be sent */
conn->outCount = 0;
return 0;

The idea of the above commit was:
Instead, let's fix this in a more general fashion by eliminating
pqHandleSendFailure altogether, and instead arranging to postpone
all reports of send failures in libpq until after we've made an
attempt to read and process server messages. The send failure won't
be reported at all if we find a server message or detect input EOF.

but that doesn't work here, because we never process the error
message. There's no code in pqParseInput3() to process server messages
while doing copy.

I'm honestly a bit baffled. How can we not have noticed that COPY FROM
STDIN doesn't handle errors before the input is exhausted? It's not just
pgbench, it's psql (and I asume pg_restore) etc as well.

Greetings,

Andres Freund

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2020-06-03 20:26:43 Re: significant slowdown of HashAggregate between 9.6 and 10
Previous Message Bruce Momjian 2020-06-03 19:57:31 Re: Internal key management system