Re: OOM in libpq and infinite loop with getCopyStart()

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
Cc: Aleksander Alekseev <a(dot)alekseev(at)postgrespro(dot)ru>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, David Steele <david(at)pgmasters(dot)net>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: OOM in libpq and infinite loop with getCopyStart()
Date: 2016-04-01 15:30:15
Message-ID: 24249.1459524615@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I wrote:
> So the core of my complaint is that we need to fix things so that, whether
> or not we are able to create the PGRES_FATAL_ERROR PGresult (and we'd
> better consider the behavior when we cannot), ...

BTW, the real Achilles' heel of any attempt to ensure sane behavior at
the OOM limit is this possibility of being unable to create a PGresult
with which to inform the client that we failed.

I wonder if we could make things better by keeping around an emergency
backup PGresult struct. Something along these lines:

1. Add a field "PGresult *emergency_result" to PGconn.

2. At the very beginning of any PGresult-returning libpq function, check
to see if we have an emergency_result, and if not make one, ensuring
there's room in it for a reasonable-size error message; or maybe even
preload it with "out of memory" if we assume that's the only condition
it'll ever be used for. If malloc fails at this point, just return NULL
without doing anything or changing any libpq state. (Since a NULL result
is documented as possibly caused by OOM, this isn't violating any API.)

3. Subsequent operations never touch the emergency_result unless we're
up against an OOM, but it can be used to return a failure indication
to the client so long as we leave libpq in a state where additional
calls to PQgetResult would return NULL.

Basically this shifts the point where an unreportable OOM could happen
from somewhere in the depths of libpq to the very start of an operation,
where we're presumably in a clean state and OOM failure doesn't leave
us with a mess we can't clean up.

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Daniel Verite 2016-04-01 15:42:42 Re: raw output from copy
Previous Message Tom Lane 2016-04-01 15:07:03 Re: OOM in libpq and infinite loop with getCopyStart()