Re: PQexec() hangs on OOM

From: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Oleksandr Shulgin <oleksandr(dot)shulgin(at)zalando(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Pg Bugs <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: PQexec() hangs on OOM
Date: 2015-09-07 08:10:10
Message-ID: CAB7nPqQY=eXpgkNLaf+xQ0u3ZYAw6wHs72TfEVtiJQtT3HE7oA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On Sat, Sep 5, 2015 at 9:45 PM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
>
> On Fri, Sep 4, 2015 at 12:55 PM, Michael Paquier <michael(dot)paquier(at)gmail(dot)com> wrote:
>>
>>
>> On Fri, Sep 4, 2015 at 2:07 PM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>>>>
>>>> It was actually a far cleaner approach to have a failure flag to decide if based on the async state process should move on to a failure code path or not.
>>>
>>>
>>> I think you have already tried, but it seems to me that if we can handle
>>> it based on result status, that would be better, rather than introducing
>>> new state, but anyway lets first try to sort out above reported problem.
>>
>>
>> So, looking at that again with a fresh eye, I noticed that getCopyDataMessage has a similar error handling when it fails on pqCheckInBufferSpace. So looking at what I added, it seems that I am trying to duplicate the protocol error handling even if everything is already present, so that's really overdoing it. Hence, let's simply drop this idea of new flag PGASYNC_FATAL and instead treat the OOM as a sync handling error with the server, like in the patch attached.
>>
>
> I think sync handling error, drops the connection which I feel should
> be only done as a last resort for any error and if we are able to handle
> OOM error for other kind of messages, then we should try to handle
> it for Copy related messages as well. I have tried to do it for Copy
> In and Copy Out protocol messages in the attached patch with logic
> similar to what is currently used in code. The idea is that in Copy
> mode if there is error we just return the error and the pending data
> will be automatically discarded in next execution during PQexecStart().

OK. So you rely on the error state set by pqSaveErrorResult, then wrap
an error check in getCopyResult and PQexecFinish... I guess that this
is acceptable for COPY as this code path would just kept looping
infinitely on the current HEAD.

>
>
>> When emulating an OOM with this patch, I am getting this error instead of the infinite loop, which looks acceptable to me:
>> =# copy aa to stdout;
>> out of memory
>> lost synchronization with server: got message type "H", length 5
>> The connection to the server was lost. Attempting reset: Succeeded.
>>
>> The extra message handling I have added at the end of getCopyStart and getParamDescriptions still looks more adapted to me when a failure happens, so I kept it.
>
>
> Sure, but then make the handling similar to getRowDescriptions() both for
> failure and success case, that way code will be consistent.

No problem for me, and this makes actually refactoring easier as those
code paths are more similar.

> Still I have not checked about COPY_BOTH and getParamDescriptions(),
> see if you like what I have done in attached patch, then we can do the
> similar for them as well.

Now, to move into the serious things...

+ /*
+ * Advance inStart to show that the copy related message has been
+ * processed.
+ */
+ conn->inStart = conn->inCursor;
This change...

+ /* getCopyStart() moves
inStart itself */
conn->asyncStatus = PGASYNC_COPY_IN;
- break;
+ continue;
... And this change are risky for a backpatch. And they actually
break the existing replication protocol and backward compatibility so
I have no doubt that they could break any client applications that
work directly with the replication level protocol with commands like
BASE_BACKUP, START_REPLICATION and IDENTITY_SYSTEM. I think that we
should really try to keep the non-error code path as close as possible
to the original code.

So, I think that the right approach would be to leave immediately
pqParseInput3 should an error happen, switching asyncStatus to leave
the loop in PQgetResult. This seems as well to work correctly with
PGRES_COPY_BOTH (I emulated failures with pg_basebackup and errors
were caught up correctly. This brings back of course the introduction
of a new flag PGASYNC_FATAL and we could replace the extra stuff you
added in getCopyResult as well. I have a WIP patch that I don't have
the time to finish now, but I'll send it once I am done. For now I am
just sharing my thoughts on the matter.
--
Michael

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Amit Kapila 2015-09-07 09:45:04 Re: PQexec() hangs on OOM
Previous Message Amit Kapila 2015-09-05 12:45:24 Re: PQexec() hangs on OOM