From: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
---|---|
To: | Michael Paquier <michael(dot)paquier(at)gmail(dot)com> |
Cc: | Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Oleksandr Shulgin <oleksandr(dot)shulgin(at)zalando(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Pg Bugs <pgsql-bugs(at)postgresql(dot)org> |
Subject: | Re: PQexec() hangs on OOM |
Date: | 2015-10-10 12:06:36 |
Message-ID: | CAA4eK1LyEcuX0QzpmQY5JtQkH8u6gCy8kFskn0+GOOgz5y6W2A@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-bugs |
On Sat, Sep 19, 2015 at 10:44 AM, Michael Paquier <michael(dot)paquier(at)gmail(dot)com
> wrote:
>
>
> On Fri, Sep 18, 2015 at 11:32 PM, Amit Kapila wrote:
> > IIRC, this is required to sanely report "out of memory" error in case
> > of replication protocol (master-standby setup). This loop and
in-particular
> > this check is quite similar to PQexecFinish() functions check and loop
> > where we return last result. I think it is better to keep both the
places
> > in-sync
> > and also I think this is required to report the error appropriately. I
have
> > tried manual debugging for the out of memory error for this case and
> > it works well with this check and without the check it doesn't report
> > the error in an appropriate way(don't remember exactly what was
> > the difference). If required, I can again try to reproduce the scenario
> > and share the exact report.
>
> I just had a look at that again... Put for example a call to pg_usleep in
libpqrcv_identify_system after executing IDENTIFY_SYSTEM and before calling
PQresultStatus, then take a breakpoint on the WAL receiver process when
starting up a standby. This gives plenty of room to emulate the OOM failure
in getCopyStart. When the check on PGRES_FATAL_ERROR is not added and when
emulating the OOM immediately, libpqrcv_PQexec loops twice and thinks it
can start up strrep but fails afterwards. Here is the failure seen from the
standby:
> LOG: started streaming WAL from primary at 0/3000000 on timeline 1
> FATAL: could not send data to WAL stream: server closed the connection
unexpectedly
> And from the master:
> LOG: unexpected EOF on standby connection
> The WAL receiver process ultimately restarts after.
>
> When the check on PGRES_FATAL_ERROR is added, strrep fails to start
appropriately and the error is fetched correctly by the WAL receiver:
> FATAL: could not start WAL streaming: out of memory
>
> In short: Amit seems right to have added this check.
>
Okay, in that case, you can revert that change in your first patch and
then that patch will be Ready for committer.
In second patch [1], the handling is to continue on error as below:
- if (getParamDescriptions(conn))
+ if (getParamDescriptions(conn, msgLength))
return;
- break;
+ /* getParamDescriptions() moves inStart itself */
+ continue;
Now, assume there is "out of memory" error in getParamDescription(),
the next message is 'T' which leads to a call to getRowDescriptions()
and in that function if there is an error in below part of code, then I
think it will just overwrite the previous "out of memory" error (I have
tried by manual debugging and forcefully generating the below error):
getRowDescriptions()
{
..
if (pqGetInt(&(result->numAttributes), 2, conn))
{
/* We should not run out of data here, so complain */
errmsg = libpq_gettext("insufficient data in \"T\" message");
goto advance_and_error;
}
..
}
Here, the point to note is that for similar handling of error from
getRowDescriptions(), we just skip the consecutive 'D' messages
by checking resultStatus.
I think overwriting of error for this case is not the right behaviour.
[1] - 0002-Fix-OOM-error-handling-in-BIND-protocol-of-libpq-2.patch
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
From | Date | Subject | |
---|---|---|---|
Next Message | Michael Paquier | 2015-10-11 13:01:38 | Re: PQexec() hangs on OOM |
Previous Message | Noah Misch | 2015-10-10 02:16:51 | Re: Re: [BUGS] BUG #13611: test_postmaster_connection failed (Windows, listen_addresses = '0.0.0.0' or '::') |