Re: Determine state of cluster (HA)

From: Jehan-Guillaume de Rorthais <jgdr(at)dalibo(dot)com>
To: Craig Ringer <craig(at)2ndquadrant(dot)com>
Cc: "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Determine state of cluster (HA)
Date: 2017-10-16 12:10:25
Message-ID: 20171016141025.6bc022b8@firost
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, 16 Oct 2017 10:39:16 +0800
Craig Ringer <craig(at)2ndquadrant(dot)com> wrote:

> On 13 October 2017 at 08:50, Joshua D. Drake <jd(at)commandprompt(dot)com> wrote:

> > I had a long call with a firm developing front end proxy/cache/HA for
> > Postgres today. Essentially the software is a replacement for PGPool in
> > entirety but also supports analytics etc... When I was asking them about
> > pain points they talked about the below and I was wondering if this is a
> > problem we would like to solve.
>
> IMO: no one node knows the full state of the system, or can know it.

+1

> I'd love PostgreSQL to help users more with scaling, HA, etc. But I
> think it's a big job. We'd need:
>
> - a node topology of some kind, communicated between nodes
> - heartbeat and monitoring
> - failover coordination
> - pooling/proxying
> - STONITH/fencing
> - etc.

And some of items on this list can not be in core. However, there's some things
PostgreSQL can do to make HA easier to deal with.

> That said, I do think it'd be very desirable for us to introduce a
> greater link from a standby to master:
>
> - Get info about master. We should finish merging recovery.conf into
> postgresql.conf.

agree. +1.

To make things easier from the "cluster manager" piece outside of PostgreSQL, I
would add:

* being able to "demote" a master as a standby without restart.
* being able to check the status of each node without eating a backend
connection (to avoid hitting "max_connection" limit)
* being able to monitor each step of a switchover (or "controlled
failover": standby/master role swapping between two nodes)

> > b. Attempt to connect to the host directly, if not...
> > c. use the slave and use the hostname via dblink to connect to the master,
> > as the hostname , i.e. select * from dblink('" + connInfo + "
> > dbname=postgres', 'select inet_server_addr()') AS t(inet_server_addr inet).
> > This is necessary in the event the hostname used in the recovery.conf file
> > is not resolvable from the outside.
>
> OK, so "connect directly" here means from some 3rd party, the one
> doing the querying of the replica.

It seems to me the failover process is issuing all required commands to move the
master role to another available standby. The knowledge of the orchestration
and final status (if everything went good) is in this piece of software. If you
want to know where is your master in an exotic or complex setup, ask who was
responsible to promote your master.

HA should stay as simple as possible. The more the architecture is complex, the
more you will have failing scenarios.

> > 1. The dblink call doesn't have a way to specify a timeout, so we have to
> > use Java futures to control how long this may take to a reasonable amount of
> > time;
>
> statement_timeout doesn't work?
>
> If not, that sounds like a sensible, separate feature to add. Patches welcome!
>
> > 2. NAT mapping may result in us detecting IP ranges that are not accessible
> > to the application nodes.
>
> PostgreSQL can't do anything about this one.

You could get the master IP address from the "pg_stat_wal_receiver" view. But
this is still not enough though. You might have dedicated networks for
applications and for pgsql replication both separated. If you want a standby
to tell the application where to connect to the master then you'll have to
put this information yourself somewhere, accessible from application nodes.

> > 3. there is no easy way to monitor for state changes as they happen,
> > allowing faster failovers, everything has to be polled based on events;

In the corosync world (the clustering piece of the Pacemaker ecosystem), node
failure are detected really really fast. About 1s.

Considering application failure (pgsql here), this will be polling, yes. But I
fail to imagine how a dying application can warn the cluster before dying. Not
only crashing (systemd could help there), but eg. before entering an infinite
dummy loop or an exhausting one.

> It'd be pretty simple to write a function that sleeps in the backend
> until it's promoted. I don't know off the top of my head if we set all
> proc latches when we promote, but if we don't it's probably harmless
> and somewhat useful to do so.

As soon as the cluster manager promoted a new master, it can trigger and event
to notify whatever you need.

> Either way, you'd do long-polling. Call the function and let the
> connection block until something interesting happens. Use TCP
> keepalives to make sure you notice if it dies. Have the function
> return when the state changes.

This would still rely on TCP keepalive frequency, back to polling :(

Regards,

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message alain radix 2017-10-16 12:11:50 ERROR: MultiXactId 3268957 has not been created yet -- apparent wraparound after missused pg_resetxlogs
Previous Message RADIX Alain - externe 2017-10-16 12:03:41 ERROR: MultiXactId 3268957 has not been created yet -- apparent wraparound after missused pg_resetxlogs