OOM in libpq and infinite loop with getCopyStart()

From: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To: PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject: OOM in libpq and infinite loop with getCopyStart()
Date: 2016-03-01 05:38:23
Message-ID: CAB7nPqSuM7MqC+viXPgy3vu+kc7ZC_HNSrSej+mvc8N7ca9WdA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi all,

One year and a half ago Heikki has reported that libpq could run on an
infinite loop in a couple of code paths should an OOM happen in
PQExec():
http://www.postgresql.org/message-id/547480DE.4040408@vmware.com
The code paths are:
- BIND messages for getRowDescriptions(), or 'T' messages, fixed per 7b96bf44
- Code paths handling some calls to PQmakeEmptyPGresult(), fixed per 414bef3
- Start message for COPY command, not fixed yet.
The two first problems have been addressed, not the last one. And as
the last thread became long and confusing here is a revival of the
topic on a new thread, with a fresh patch, and a fresh study of the
problem submitted for this CF because it would be nice to get things
fixed, or improved in some cases, but see below...

Here is a summary of what this patch does and improves in some cases:
1) COPY IN and COPY OUT, without the patch, those two remain stuck in
an infinite loop... With the patch an error "out of memory" is
reported to the caller.
I had a look as well at some community utilities, like pgloader and
pg_bulkload, though none of them are using getCopyStart(). Do other
people have ideas to things to look at? pgadmin for example?

Now in the case of COPY_BOTH. let's take a couple of examples:
2) pg_receivexlog:
In the case of pg_receivexlog, the patch allows libpq to throw
correctly an error back to the caller, then pg_receivexlog will try to
send START_REPLICATION again at its next loop:
pg_receivexlog: could not send replication command
"START_REPLICATION": out of memory
Without the patch pg_receivexlog would just ignore the error, and at
the next iteration of ParseInput() the message obtained from server
would be treated because the message cursor does not move on in this
case.
3) pg_recvlogical
Similarly pg_recvlogical fails with the following error with the patch attached:
pg_recvlogical: could not send replication command "START_REPLICATION
SLOT "toto" LOGICAL 0/0": out of memory
And after the failure it loops back to the next attempt, and is able
to perform its streaming stuff. Without the patch, similarly to
pg_receivexlog, COPY_BOTH would take the code path of ParseInput()
once again and it would treat the message, ignoring any OOM problems
on the way.
4) libpqwalreceiver: upon noticing the OOM error with the patch
attached, a standby would fail after sending START_REPLICATION, log
the OOM error properly and attempt to restart later a new WAL receiver
process to try again. Without this patch, the WAL receiver would still
fail to start, and log the following, unhelpful message:
2016-03-01 14:07:00 JST [84058]: [22-1] db=,user=,app=,client= LOG:
invalid record length at 0/3000970
That's not really informational for the user :( Note though that in
this case the WAL receiver does not remain in an infinite loop, and
that it is able to restart properly because it fails with this
"invalid record length error", and then start streaming at its next
attempt when the WAL receiver is started again.

So, that's how things are standing. I still see that it would be a
gain in terms of failure visibility to log properly this OOM failure
in all cases. Now it is true that if there are some applications that
may expect libpq to *not* return an error and treat the COPY start
message at the next loop of ParseInput(), though by looking at what is
in-core things can be handled properly.

Thoughts? I have registered that in the CF app, and a patch is attached.
--
Michael

Attachment Content-Type Size
0001-Fix-OOM-error-handling-in-COPY-protocol-of-libpq.patch text/x-diff 7.2 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2016-03-01 06:11:24 Re: remove wal_level archive
Previous Message Rushabh Lathia 2016-03-01 05:34:14 Logic problem in SerializeSnapshot()