Re: libpq should append auth failures, not overwrite

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Michael Paquier <michael(at)paquier(dot)xyz>
Cc: pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: libpq should append auth failures, not overwrite
Date: 2018-08-10 16:36:00
Message-ID: 25918.1533918960@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox
Thread:
Lists: pgsql-hackers

Michael Paquier <michael(at)paquier(dot)xyz> writes:
> On Thu, Aug 09, 2018 at 11:44:27AM -0400, Tom Lane wrote:
>> I noticed that, although most error reports during libpq's connection
>> setup code append to conn->errorMessage, the ones in fe-auth.c and
>> fe-auth-scram.c don't: they're all printfPQExpBuffer() not
>> appendPQExpBuffer(). This seems wrong to me. It makes no difference
>> in simple cases with a single target server, but as soon as you have
>> multiple servers listed in "host", this coding makes it impossible
>> to tell which server rejected your login.

> +1.

So I thought this would be a trivial patch barely even worthy of posting,
but an awful lot of critters crawled out from under that rock. It's not
just fe-auth*.c at fault; low-level failures such as timeout errors were
also clearing conn->errorMessage, and if you got an actual server error
(like "role does not exist"), that did too. What I ended up doing was a
wholesale conversion of libpq's management of conn->errorMessage. Now,
there are a few identified points at connection start or query start that
explicitly clear errorMessage, and everyplace else in libpq is supposed to
append to it, never just printf onto it. (Interestingly, all of those
"identified points" already did clear errorMessage, meaning that a large
fraction of the existing printf's would always have started with an empty
errorMessage anyway.)

The first patch attached does that, and then the second one is a proposed
modification in the way we report failures for multi-host connection
attempts: let's explicitly label each section of conn->errorMessage with
the host we're trying to connect to. This can replace the ad-hoc labeling
that somebody thought would be a good idea for two errors associated with
the target_session_attrs=read-write feature. (That's not a bad idea in
itself, but doing it only for those two errors is.)

Here's some examples of the kind of connection failure reports the code
can produce with these patches:

$ psql "host=refusenik,dead,localhost user=readonly dbname=postgres connect_timeout=3 target_session_attrs=read-write"
psql: server "refusenik" port 5432:
could not connect to server: Connection refused
Is the server running on host "refusenik" (192.168.1.43) and accepting
TCP/IP connections on port 5432?
server "dead" port 5432:
timeout expired
server "localhost" port 5432:
could not open a read-write session

$ psql "host=refusenik,dead,localhost user=nobody dbname=postgres"
psql: server "refusenik" port 5432:
could not connect to server: Connection refused
Is the server running on host "refusenik" (192.168.1.43) and accepting
TCP/IP connections on port 5432?
server "dead" port 5432:
timeout expired
server "localhost" port 5432:
FATAL: role "nobody" does not exist

Before, you'd get a rather random subset of these messages depending on
which parts of the code decided to clear conn->errorMessage, and in
many cases it'd be really hard to tell which server was involved with
which message(s).

Some points for discussion and review:

1. The bulk of patch 0001 is indeed just
s/printfPQExpBuffer/appendPQExpBuffer/g, though I also made an attempt
to use appendPQExpBufferStr wherever possible. There are two categories
of printfPQExpBuffer calls I didn't change:

1a. Calls reporting an out-of-memory situation. There was already a
policy in some places that we'd intentionally printf not append such
messages, and I made that explicit. The idea is that we might not
have room to add more text to errorMessage, so resetting the buffer
provides more certainty of success. However, this makes for a pretty
weird inconsistency in the code; there are a lot of functions now in
which half of the messages are printf'd and half are append'd, so I'm
afraid that future patches will get it wrong as to which to use. Also,
in reality there often *would* be room to append "out of memory" without
enlarging errorMessage's buffer, so that this could just be pointless
destruction of useful error history. I didn't do anything about it here,
but I'm tempted to replace all the printf's for OOM messages with a
subroutine that will append if it can do so without enlarging the existing
buffer, else replace.

1b. There are a few places where it didn't seem worth changing anything
because it'd just result in needing to add a resetPQExpBuffer in the same
function or very nearby. Mostly in fe-lobj.c.

2. I had to drop some code (notably in pqPrepareAsyncResult) that
attempted to force conn->errorMessage to always have the same contents
as the current PGresult's PQresultErrorMessage; as it stood, server
errors such as "role does not exist" wiped out preceding contents of
errorMessage, breaking cases such as the second example above. This is
slightly annoying, but in the new dispensation it's possible that
conn->errorMessage contains a mix of libpq-generated errors and text
received from the server, so I'm not sure that preserving the old
equivalence is a useful goal anyway. We shouldn't cram pre-existing
errorMessage text back into a server-generated error; that would
invalidate the server's auxiliary error info such as the SQLSTATE.
I don't know if it's possible to do better here. One idea is to tweak
pqPrepareAsyncResult so that it does the message synchronization only
when in CONNECTION_OK state, allowing conn->errorMessage to be a
historical record only for connection processing. But that seems like
a hack.

3. In some places I failed to resist the temptation to copy-edit error
messages that didn't meet style guidelines, or to add libpq_gettext()
annotation where it seemed to have been missed.

4. connectDBComplete() cleared errorMessage after a successful
completion, but that's the wrong place to do it; we have to do that
at the success exits from PQconnectPoll, else applications that
use PQconnectPoll directly will see leftover garbage there.

5. There were places in PQconnectPoll that temporarily set
conn->status = CONNECTION_OK to prevent PQsendQueryStart from
complaining. A better idea is to leave the status alone and change
that test in PQsendQueryStart to be status != CONNECTION_BAD. This
not only simplifies PQconnectPoll, but allows PQsendQueryStart
to tell whether we're still in the connection sequence. I fixed
it to not clear errorMessage when we are, thus solving the problem
of PQsendQuery destroying prior connection error history. That allows
getting rid of saveErrorMessage/restoreErrorMessage, which were
(a) an incredibly ugly and inefficient hack, and (b) inadequate,
since there were other paths through PQconnectPoll that could reach
PQsendQueryStart but were not protected with calls to those.

6. In the 0002 patch, I have it printing the annotation only when
there's more than one connhost[], and only once per connhost not
once per IP address. The motivation for doing it like that was to
not change the behavior for simple cases with only one host name.
There is certainly room to argue that we should make an annotation
once per IP address, but that'd change the behavior for pretty common
cases like "localhost" resolving to both 127.0.0.1 and ::1, so I don't
know if it'd be a good idea or not. There's also an interaction to
consider with my patch at
https://www.postgresql.org/message-id/4913.1533827102%40sss.pgh.pa.us
namely that we have to be sure the labeling behaves sanely for
connhosts that fail to resolve any IP addresses.

I'll put this into the September commitfest.

regards, tom lane

Attachment Content-Type Size
0001-libpq-append-errors-1.patch text/x-diff 87.4 KB
0002-libpq-better-server-id-1.patch text/x-diff 3.9 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2018-08-10 16:46:53 Re: xact_start meaning when dealing with procedures?
Previous Message Michael Paquier 2018-08-10 16:30:21 Re: Improve behavior of concurrent TRUNCATE