Quick Links

Re: [bug fix] PG10: libpq doesn't connect to alternative hosts when some errors occur

From:	"Tels" <nospam-pg-abuse(at)bloodgate(dot)com>
To:	"Tsunakawa, Takayuki" <tsunakawa(dot)takay(at)jp(dot)fujitsu(dot)com>
Cc:	"'Michael Paquier'" <michael(dot)paquier(at)gmail(dot)com>, "Robert Haas" <robertmhaas(at)gmail(dot)com>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Stephen Frost" <sfrost(at)snowman(dot)net>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: [bug fix] PG10: libpq doesn't connect to alternative hosts when some errors occur
Date:	2017-05-19 11:59:44
Message-ID:	e97450e4ecfad32cae2da2858b01c225.squirrel@sm.webmail.pair.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Thu, May 18, 2017 10:24 pm, Tsunakawa, Takayuki wrote:
> From: pgsql-hackers-owner(at)postgresql(dot)org
>> [mailto:pgsql-hackers-owner(at)postgresql(dot)org] On Behalf Of Michael
>> Paquier
>> On Thu, May 18, 2017 at 11:30 PM, Robert Haas <robertmhaas(at)gmail(dot)com>
>> wrote:
>> > Because why?
>>
>> Because it is critical to let the user know as well *why* an error
>> happened.
>> Imagine that this feature is used with multiple nodes, all primaries.
>> If
>> a DB admin busted the credentials in one of them then all the load
>> would
>> be redirected on the other nodes, without knowing what is actually
>> causing
>> the error. Then the node where the credentials have been changed would
>> just
>> run idle, and the application would be unaware of that.
>
> In that case, the DBA can know the authentication errors in the server log
> of the idle instance.
>
> I'm sorry to repeat myself, but libpq connection failover is the feature
> for HA. So I believe what to prioritize is HA.

I'm in agreement here, the feature for me sounds very useful for HA, but
HA means it needs to work as reliable as possible, not just "often enough"
:)

If one goes to the length to have multiple instances, there is surely also
monitoring in place to see if they are healthy or under load/stress.

The beaty of having libpq connecting to multiple hosts until one works is
that you can handle temporary unavailability (e.g. one instance is
restarted for patching) and general failure (one instance goes down to
whatever error) in one place and without having to implement this logic
into every app (database user connector).

Imagine f.i. that you have 3 hosts A, B and C and B.

There are two scenarioes here: Everyone tries "A,B,C", or everyone tries
random permutations like "A,C,B" or "B,C,A".

If In the first scenary, almost all connections would go to A, until it no
longer accepts no connections, then they spill over to B.

In the second one, each host gets 1/3 of all connections equally.

Now imagine that B is down for either a brief period or permantently.

If libpq stops when it connects to B, then the scenarios play out like this:

1: Almost all connections run on A, but a random subset breaks when
spillover to B does not happen. Host C is idle.

2: 2/3 of all connections just work, 1/3 just breaks. Both A and C have a
higher load than usual.

If libpq skips B and continues, then we have instead:

1: Almost all connections run on A, but a random subset spills over to C
after skipping B.

2: All connections run on A or C, B is always skipped if it appears before
A or C.

The admin would see on the monitoring that B is offline (briefly or
permanent) and need to correct it.

From the user's perspective, the second variant is smooth, the first is
breaking randomly. A "database user" would not really want to know that B
is down or why, it would just expect to get a working DB connection.

That's my 0.02 € anyway.

Tels

In response to

Re: [bug fix] PG10: libpq doesn't connect to alternative hosts when some errors occur at 2017-05-19 02:24:34 from Tsunakawa, Takayuki

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Robert Haas	2017-05-19 12:20:22	Re: Preliminary results for proposed new pgindent implementation
Previous Message	Rafia Sabih	2017-05-19 11:55:38	[POC] Faster processing at Gather node