Re: Multiple hosts in connection string failed to failover in non-hot standby mode

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Hubert Zhang <zhubert(at)vmware(dot)com>
Cc: "tsunakawa(dot)takay(at)fujitsu(dot)com" <tsunakawa(dot)takay(at)fujitsu(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Multiple hosts in connection string failed to failover in non-hot standby mode
Date: 2021-01-10 22:38:50
Message-ID: 770711.1610318330@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I wrote:
> The problems that I see in this area are first that there's no
> real standardization in libpq as to whether to append error messages
> together or just flush preceding messages; and second that no effort
> is made in multi-connection-attempt cases to separate the errors from
> different attempts, much less identify which error goes with which
> host or IP address. I think we really need to put some work into
> that.

I spent some time on this, and here is a patch set that tries to
improve matters.

0001 changes the libpq coding rules to be that all error messages should
be appended to conn->errorMessage, never overwritten (there are a couple
of exceptions in fe-lobj.c) and we reset conn->errorMessage to empty only
at the start of a connection request or new query. This is something
that's been bugging me for a long time and I'm glad to get it cleaned up.
Formerly it seemed that a dartboard had been used to decide whether to use
"printfPQExpBuffer" or "appendPQExpBuffer"; now it's always the latter.
We can also get rid of several hacks that were used to get around the
mess and force appending behavior.

0002 then changes the reporting rules in fe-connect.c as I suggested,
so that you might get errors like this:

$ psql -h localhost,/tmp -p 12345
psql: error: could not connect to host "localhost" (::1), port 12345: Connection refused
Is the server running on that host and accepting TCP/IP connections?
could not connect to host "localhost" (127.0.0.1), port 12345: Connection refused
Is the server running on that host and accepting TCP/IP connections?
could not connect to socket "/tmp/.s.PGSQL.12345": No such file or directory
Is the server running locally and accepting connections on that socket?

and we have a pretty uniform rule that errors coming back from a
connection attempt will be prefixed with the server address.

Then 0003 is the part of your patch that I'm happy with. Given 0001+0002
we could certainly consider looping back and retrying for more cases, but
I still want to tread lightly on that. I don't see a lot of value in
retrying errors that seem to be on the client side, such as failure to
set socket properties; and in general I'm hesitant to add untestable
code paths here.

I feel pretty good about 0001: it might be committable as-is. 0002 is
probably subject to bikeshedding, plus it has a problem in the ECPG tests.
Two of the error messages are now unstable because they expose
chosen-at-random socket paths:

diff -U3 /home/postgres/pgsql/src/interfaces/ecpg/test/expected/connect-test5.stderr /home/postgres/pgsql/src/interfaces/ecpg/test/results/connect-test5.stderr
--- /home/postgres/pgsql/src/interfaces/ecpg/test/expected/connect-test5.stderr 2020-08-04 14:59:45.617380050 -0400
+++ /home/postgres/pgsql/src/interfaces/ecpg/test/results/connect-test5.stderr 2021-01-10 16:12:22.539433702 -0500
@@ -36,7 +36,7 @@
[NO_PID]: sqlca: code: 0, state: 00000
[NO_PID]: ECPGconnect: opening database <DEFAULT> on <DEFAULT> port <DEFAULT> for user regress_ecpg_user2
[NO_PID]: sqlca: code: 0, state: 00000
-[NO_PID]: ECPGconnect: could not open database: FATAL: database "regress_ecpg_user2" does not exist
+[NO_PID]: ECPGconnect: could not open database: could not connect to socket "/tmp/pg_regress-EbHubF/.s.PGSQL.58080": FATAL: database "regress_ecpg_user2" does not exist

[NO_PID]: sqlca: code: 0, state: 00000
[NO_PID]: ecpg_finish: connection main closed
@@ -73,7 +73,7 @@
[NO_PID]: sqlca: code: -220, state: 08003
[NO_PID]: ECPGconnect: opening database <DEFAULT> on <DEFAULT> port <DEFAULT> for user regress_ecpg_user2
[NO_PID]: sqlca: code: 0, state: 00000
-[NO_PID]: ECPGconnect: could not open database: FATAL: database "regress_ecpg_user2" does not exist
+[NO_PID]: ECPGconnect: could not open database: could not connect to socket "/tmp/pg_regress-EbHubF/.s.PGSQL.58080": FATAL: database "regress_ecpg_user2" does not exist

[NO_PID]: sqlca: code: 0, state: 00000
[NO_PID]: ecpg_finish: connection main closed

I don't have any non-hacky ideas what to do about that. The extra detail
seems useful to end users, but we don't have any infrastructure that
would allow filtering it out in the ECPG tests.

regards, tom lane

Attachment Content-Type Size
0001-append-all-error-messages.patch text/x-diff 127.2 KB
0002-identify-host-more-consistently-wip.patch text/x-diff 8.8 KB
0003-retry-if-cannot-connect-now.patch text/x-diff 964 bytes

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 2021-01-10 23:23:41 Re: Inconsistent "<acronym>" use
Previous Message Tomas Vondra 2021-01-10 22:35:03 Re: COPY FREEZE and setting PD_ALL_VISIBLE/visibility map bits