Re: PQexec() hangs on OOM

From: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Oleksandr Shulgin <oleksandr(dot)shulgin(at)zalando(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Pg Bugs <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: PQexec() hangs on OOM
Date: 2015-09-18 17:55:38
Message-ID: 55FC501A.5000409@iki.fi
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On 09/14/2015 04:36 AM, Michael Paquier wrote:
> On Sat, Sep 12, 2015 at 6:11 PM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>> On Mon, Sep 7, 2015 at 1:40 PM, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
>> wrote:
>>>
>>> On Sat, Sep 5, 2015 at 9:45 PM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
>>> wrote:
>>>
>>> Now, to move into the serious things...
>>>
>>> + /*
>>> + * Advance inStart to show that the copy related message has been
>>> + * processed.
>>> + */
>>> + conn->inStart = conn->inCursor;
>>> This change...
>>>
>>> + /* getCopyStart() moves
>>> inStart itself */
>>> conn->asyncStatus =
>>> PGASYNC_COPY_IN;
>>> - break;
>>> + continue;
>>> ... And this change are risky for a backpatch. And they actually
>>> break the existing replication protocol
>
> First, many thanks for providing your input! I am really looking
> forward into fixing those problems appropriately.
>
>> Can you please explain how will it break replication protocol?
>> I have done the required handling for Copy Both mode as well in attached
>> patch similar to what was done for other Copy modes in previous patch.
>> Check if you still find it as broken for replication?
>
> When not enough data has been received from the server, it seems that
> we should PQclear the existing result, and leave immediately
> pqParseInput3 with the status PGASYNC_BUSY to be able to wait for more
> data. Your patch, as-is, is breaking that promise (and this comes from
> the first versions of my patches). It seems also that we should not
> write an error message on connection in this case to be consistent in
> the behavior of back-branches. For the OOM case, we definitely need to
> take the error code path though, as your patch is correctly doing, and
> what mine legendary failed to do.
>
> + if (pqGetInt(&result->numAttributes, 2, conn))
> + {
> + errmsg = libpq_gettext("extraneous data in COPY start message");
> + goto advance_and_error;
> + }
> Here an error message is not needed, and this message should have been
> formulated as "insufficient data in blah" either way.
>
>> I have only kept the changes for COPY modes, so that once we settle on
>> those, I think similar changes could be done for getParamDescriptions()
>
> Yeah, I looked into this one as well, resulting in patch 0002
> attached. In this case we have the advantage to only receive data from
> the server, so pqPrepareAsyncResult is handling the failure just fine.
> Attached as well is the test case I used (similar to previous one,
> just extended a bit to report the error message), after getting the
> result from PQsendDescribePrepared, PQresultStatus is set correctly on
> error and reports "out of memory" to the user. What do you think about
> it?

Patches 0001 and 0002 look good to me. Two tiny nitpicks:

> --- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
> +++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
> @@ -445,6 +445,7 @@ libpqrcv_PQexec(const char *query)
> if (PQresultStatus(lastResult) == PGRES_COPY_IN ||
> PQresultStatus(lastResult) == PGRES_COPY_OUT ||
> PQresultStatus(lastResult) == PGRES_COPY_BOTH ||
> + PQresultStatus(lastResult) == PGRES_FATAL_ERROR ||
> PQstatus(streamConn) == CONNECTION_BAD)
> break;
> }

Is this really needed? AFAICS, the loop will terminate on
PGRES_FATAL_ERROR too. Without this patch, it will always return the
last result from the query, whether or not it was an error result. With
this patch, it will return the first error encountered, or last
non-error result if there is no error. In practice, there is no
difference because we don't expect multiple results from the server.

> @@ -1343,31 +1348,33 @@ getNotify(PGconn *conn)
> * parseInput already read the message type and length.
> */
> static int
> -getCopyStart(PGconn *conn, ExecStatusType copytype)
> +getCopyStart(PGconn *conn, ExecStatusType copytype, int msgLength)
> {
> PGresult *result;
> int nfields;
> int i;
> + const char *errmsg = NULL;
>
> result = PQmakeEmptyPGresult(conn, copytype);
> if (!result)
> - goto failure;
> + goto advance_and_error;
>
> if (pqGetc(&conn->copy_is_binary, conn))
> - goto failure;
> + goto not_enough_data;
> result->binary = conn->copy_is_binary;
> +
> /* the next two bytes are the number of fields */
> - if (pqGetInt(&(result->numAttributes), 2, conn))
> - goto failure;
> + if (pqGetInt(&result->numAttributes, 2, conn))
> + goto not_enough_data;
> nfields = result->numAttributes;
>
> /* allocate space for the attribute descriptors */
> - if (nfields > 0)
> + if (result && nfields > 0)

'result' is never NULL here, so the NULL test is still not needed.

The 0003 patch also looks OK to me as far as it goes. I feel that some
more refactoring would be in order, though. There are still many
different styles for handling a command: sometimes it's done inline in
the case-statement, sometimes it calls out to a separate function.
Sometimes the function moves inStart, sometimes it does not. On error
(e.g. short message), sometimes the function returns 'false', sometimes
it sets an error message itself.

Patches 0001 and 0002 look small enough that we probably could backpatch
them, so that would leave 0003 as a optional master-only cleanup.

I spent some time writing a regression test for this. This is pretty
difficult to test otherwise, and it's a good candidate for automated,
exchaustive testing. The only difficulty is injecting the malloc()
failures in a cross-platform way. I ended up with a glibc-only solution
that uses glibc's malloc() hooks. Other alternatives I considered were
Linux-specific LD_PRELOAD, GNU ld specific -wrap option. One
cross-platform solution would be to #define malloc() and friends to
wrapper functions, and compile a special version of libpq. Anyway, I
came up with the attached test program. It hangs without your fixes, and
works with the patches applied. Thoughts?

- Heikki

Attachment Content-Type Size
0001-Initial-version-of-libpq-mallocfail-tester.patch text/x-diff 13.8 KB

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Michael Paquier 2015-09-18 22:17:46 Re: PQexec() hangs on OOM
Previous Message lacesco 2015-09-18 06:04:24 Fw: important message