Re: [EXTERNAL] Re: Support load balancing in libpq

From: Michael Banck <mbanck(at)gmx(dot)net>
To: Jelte Fennema <Jelte(dot)Fennema(at)microsoft(dot)com>
Cc: Aleksander Alekseev <aleksander(at)timescale(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [EXTERNAL] Re: Support load balancing in libpq
Date: 2022-09-17 16:57:39
Message-ID: 6325fc84.050a0220.ab071.038d@mx.google.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On Mon, Sep 12, 2022 at 02:16:56PM +0000, Jelte Fennema wrote:
> Attached is an updated patch with the following changes:
> 1. rebased (including solved merge conflict)
> 2. fixed failing tests in CI
> 3. changed the commit message a little bit
> 4. addressed the two remarks from Micheal
> 5. changed the prng_state from a global to a connection level value for thread-safety
> 6. use pg_prng_uint64_range

Thanks!

I tested this some more, and found it somewhat surprising that at least
when looking at it on a microscopic level, some hosts are chosen more
often than the others for a while.

I basically ran

while true; do psql -At "host=pg1,pg2,pg3 load_balance_hosts=1" -c
"SELECT inet_server_addr()"; sleep 1; done

and the initial output was:

10.0.3.109
10.0.3.109
10.0.3.240
10.0.3.109
10.0.3.109
10.0.3.240
10.0.3.109
10.0.3.240
10.0.3.240
10.0.3.240
10.0.3.240
10.0.3.109
10.0.3.240
10.0.3.109
10.0.3.109
10.0.3.240
10.0.3.240
10.0.3.109
10.0.3.60

I.e. the second host (pg2/10.0.3.60) was only hit after 19 iterations.

Once significantly more than a hundred iterations are run, the hosts
somewhat even out, but it is maybe suprising to users:

50 100 250 500 1000 10000
10.0.3.60 9 24 77 165 328 3317
10.0.3.109 25 42 88 178 353 3372
10.0.3.240 16 34 85 157 319 3311

Or maybe my test setup is skewed? When I choose a two seconds timeout
between psql calls, I get a more even distribution initially, but it
then diverges after 100 iterations:

50 100 250 500 1000
10.0.3.60 19 36 98 199 374
10.0.3.109 13 33 80 150 285
10.0.3.240 18 31 72 151 341

Could just be bad luck...

I also switch one host to have two IP addresses in /etc/hosts:

10.0.3.109 pg1
10.0.3.60 pg1
10.0.3.240 pg3

And this resulted in this (one second timeout again):

First run:

50 100 250 500 1000
10.0.3.60 10 18 56 120 255
10.0.3.109 14 30 67 139 278
10.0.3.240 26 52 127 241 467

Second run:

50 100 250 500 1000
10.0.3.60 20 31 77 138 265
10.0.3.109 9 20 52 116 245
10.0.3.240 21 49 121 246 490

So it looks like it load-balances between pg1 and pg3, and not between
the three IPs - is this expected?

If I switch from "host=pg1,pg3" to "host=pg1,pg1,pg3", each IP adress is
hit roughly equally.

So I guess this is how it should work, but in that case I think the
documentation should be more explicit about what is to be expected if a
host has multiple IP addresses or hosts are specified multiple times in
the connection string.

> > Maybe my imagination is not so great, but what else than hosts could we
> > possibly load-balance? I don't mind calling it load_balance, but I also
> > don't feel very strongly one way or the other and this is clearly
> > bikeshed territory.
>
> I agree, which is why I called it load_balance in my original patch.
> But I also think it's useful to match the naming for the already
> existing implementations in the PG ecosystem around this.
> But like you I don't really feel strongly either way. It's a tradeoff
> between short name and consistency in the ecosystem.

I don't think consistency is an extremely valid concern. As a
counterpoint, pgJDBC had targetServerType some time before Postgres, and
the libpq parameter was then named somewhat differently when it was
introduced, namely target_session_attrs.

> > If I understand correctly, you've added DNS-based load balancing on top
> > of just shuffling the provided hostnames.  This makes sense if a
> > hostname is backed by more than one IP address in the context of load
> > balancing, but it also complicates the patch. So I'm wondering how much
> > shorter the patch would be if you leave that out for now?
>
> Yes, that's correct and indeed the patch would be simpler without, i.e. all the
> addrinfo changes would become unnecessary. But IMHO the behaviour of
> the added option would be very unexpected if it didn't load balance across
> multiple IPs in a DNS record. libpq currently makes no real distinction in
> handling of provided hosts and handling of their resolved IPs. If load balancing
> would only apply to the host list that would start making a distinction
> between the two.

Fair enough, I agree.

> Apart from that the load balancing across IPs is one of the main reasons
> for my interest in this patch. The reason is that it allows expanding or reducing
> the number of nodes that are being load balanced across transparently to the
> application. Which means that there's no need to re-deploy applications with
> new connection strings when changing the number hosts.

That's a good point as well.

> > On the other hand, I believe pgJDBC keeps track of which hosts are up or
> > down and only load balances among the ones which are up (maybe
> > rechecking after a timeout? I don't remember), is this something you're
> > doing, or did you consider it?
>
> I don't think it's possible to do this in libpq without huge changes to its
> architecture, since normally a connection will only a PGconn will only
> create a single connection. The reason pgJDBC can do this is because
> it's actually a connection pooler, so it will open more than one connection
> and can thus keep some global state about the different hosts.

Ok.

Michael

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2022-09-17 16:58:49 Re: [RFC] building postgres with meson - v13
Previous Message Andres Freund 2022-09-17 16:10:06 Re: Patch to address creation of PgStat* contexts with null parent context