Re: Cryptic error message in low-memory conditions

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Daniel Farina <daniel(at)heroku(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Cryptic error message in low-memory conditions
Date: 2011-08-26 23:59:35
Message-ID: 8040.1314403175@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Daniel Farina <daniel(at)heroku(dot)com> writes:
> Some Postgres servers will error out for a while with the following
> error message:
> "expected authentication request from server, but received c"
> [ and this seems to be triggered by fork failures in the server ]

<spock>Fascinating.</spock>

I poked at this for awhile by the expedient of hot-wiring postmaster.c
to always fail the fork request:

*** src/backend/postmaster/postmaster.c.orig Tue Aug 23 17:13:09 2011
--- src/backend/postmaster/postmaster.c Fri Aug 26 19:08:25 2011
***************
*** 3182,3188 ****
#ifdef EXEC_BACKEND
pid = backend_forkexec(port);
#else /* !EXEC_BACKEND */
! pid = fork_process();
if (pid == 0) /* child */
{
free(bn);
--- 3182,3189 ----
#ifdef EXEC_BACKEND
pid = backend_forkexec(port);
#else /* !EXEC_BACKEND */
! pid = -1; // fork_process();
! errno = ENOMEM;
if (pid == 0) /* child */
{
free(bn);

The basic case did what I was expecting:

$ psql "dbname=regression"
psql: could not fork new process for connection: Cannot allocate memory

which was unsurprising since I'm quite sure that code path got tested
in basically this fashion, back in the dark ages. But sure enough,
over a TCP connection with SSL support turned on, it fails as described.
strace shows the postmaster is sending what it's supposed to.
I eventually figured out the problem: libpq sees the "E" response,
thinks it must be talking to a postmaster too old to understand
NEGOTIATE_SSL_CODE, and closes the socket and tries again. But *it
forgets to clear out its input buffer*, so the body of the error message
is still there waiting to be read, and the next connection sees the "c"
as the first byte of the first postmaster response. The fix for this is

*** src/interfaces/libpq/fe-connect.c.orig Thu Jul 28 10:39:57 2011
--- src/interfaces/libpq/fe-connect.c Fri Aug 26 18:52:18 2011
***************
*** 2064,2069 ****
--- 2064,2072 ----
conn->allow_ssl_try = false;
/* Assume it ain't gonna handle protocol 3, either */
conn->pversion = PG_PROTOCOL(2, 0);
+ /* Discard any unread/unsent data */
+ conn->inStart = conn->inCursor = conn->inEnd = 0;
+ conn->outCount = 0;
/* Must drop the old connection */
closesocket(conn->sock);
conn->sock = -1;

and similarly in the other places in fe-connect.c where we're abandoning
an open connection (not sure every one of those has to have it, but it
seems a good safety practice).

That gets us to the right place in the default SSL mode:

$ psql "host=localhost sslmode=prefer dbname=regression"
psql: could not fork new process for connection: Cannot allocate memory

but we're still not there if you're trying to force SSL:

$ psql "host=localhost sslmode=require dbname=regression"
psql: server does not support SSL, but SSL was required

The reason for this is that that same bit of code supposes that any
"E" response must mean that the postmaster didn't recognize
NEGOTIATE_SSL_CODE. It doesn't (and of course shouldn't) pay any
attention to the actual textual error message.

Now, seeing as how NEGOTIATE_SSL_CODE has been understood by every build
since PG 7.0, I believe that this is dead code and we could remove it;
it seems exceedingly unlikely that any modern build of libpq will ever
be used to talk to a server that responds to that with "E".

In fact it's worse than just delivering a misleading error message in
the "require" case, because if you're not doing "require" then what
happens next is that the code forces the protocol level down to 2 and
tries again, supposing that it must be dealing with an ancient server.
In the normal case where you are talking to a server under load rather
than a deliberately-broken one, it's entirely possible that the second
connection attempt succeeds. And what you've got then is that the
connection is operating in protocol 2 rather than what the user probably
expected, disabling assorted functionality that he may well be depending
on. I don't recall having seen reports that could match that syndrome,
but then again Daniel's complaint is a new one on me too.

So I'm thinking we'd be well advised to eliminate the assumption that an
"E" response could be fixed by downgrading to protocol 2, and instead
just make this code report the error message it got from the postmaster.
That's more than a one-liner so I don't have a patch for it yet.

Lastly, I noticed that if I tried this repeatedly on a Unix socket,
I sometimes got

psql: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
could not send startup packet: Broken pipe

rather than the expected results. I think what is happening here is a
race condition, such that if the postmaster closes the socket without
having read the startup packet, the client might not have actually
gotten to its send() yet, and then it will get EPIPE from send() before
it gets to the point of reading the error response. I tried to fix this
by having report_fork_failure_to_client eat any pending data before
responding:

*** src/backend/postmaster/postmaster.c.orig Tue Aug 23 17:13:09 2011
--- src/backend/postmaster/postmaster.c Fri Aug 26 19:08:25 2011
***************
*** 3262,3276 ****
char buffer[1000];
int rc;

/* Format the error message packet (always V2 protocol) */
snprintf(buffer, sizeof(buffer), "E%s%s\n",
_("could not fork new process for connection: "),
strerror(errnum));

- /* Set port to non-blocking. Don't do send() if this fails */
- if (!pg_set_noblock(port->sock))
- return;
-
/* We'll retry after EINTR, but ignore all other failures */
do
{
--- 3262,3291 ----
char buffer[1000];
int rc;

+ /* Set port to non-blocking. Don't do anything if this fails */
+ if (!pg_set_noblock(port->sock))
+ return;
+
+ /*
+ * Try to read the startup packet, if it's available which it likely is.
+ * The reason for doing this is that when the client is connected via Unix
+ * socket, if we close the socket without having read data it already
+ * sent, many kernels will report EPIPE instead of letting the client
+ * read whatever we sent. That will likely result in an unhelpful
+ * "server closed the connection unexpectedly" report instead of what
+ * we are trying to tell them. So read and discard any already-available
+ * data. We already set nonblock mode, so we won't block here.
+ */
+ do
+ {
+ rc = recv(port->sock, buffer, sizeof(buffer), 0);
+ } while (rc > 0 || (rc < 0 && errno == EINTR));
+
/* Format the error message packet (always V2 protocol) */
snprintf(buffer, sizeof(buffer), "E%s%s\n",
_("could not fork new process for connection: "),
strerror(errnum));

/* We'll retry after EINTR, but ignore all other failures */
do
{

This seems to get rid of the problem on my Fedora box, but my old HPUX
box still does it some of the time. I'm inclined to recommend applying
this patch anyway --- the only alternative I can see is to try to
rejigger the libpq code to temporarily ignore EPIPE during startup,
which seems both messy and fragile.

Comments?

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 2011-08-27 00:41:23 Re: Removal of useless include references
Previous Message Alvaro Herrera 2011-08-26 22:28:56 Re: Removal of useless include references