Re: Best high availability solution ?

From: Christopher Browne <cbbrowne(at)acm(dot)org>
To: pgsql-general(at)postgresql(dot)org
Subject: Re: Best high availability solution ?
Date: 2006-05-31 12:42:12
Message-ID: 87verm5psr.fsf@wolfe.cbbrowne.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

In the last exciting episode, dpage(at)vale-housing(dot)co(dot)uk ("Dave Page") wrote:
> If I'm honest, I think your boss is going to be disappointed. You
> would add a *lot* of complexity to the system to make it handle
> failures with zero intervention, and that extra complexity is
> probably more likely to go wrong than a single server. I'd spend
> your time and money on making sure your raid & ups are good, that
> you are running on server grade hardware with ECC RAM, and that you
> have good out of band management facilities so even if you are away
> from the office you can connect via VPN/modem or whatever and fix
> things.

We have found something of the same thing with trying to get improved
reliability out of HACMP (an IBM product that automatically fails over
applications between servers).

We had previously experienced too-frequent problems due to lack of
reliability of our servers. (Sun high end stuff, as it happened...)

Moving to HACMP on AIX, well, the IBM AIX servers have been way more
reliable. Unfortunately, HACMP is all too fragile. It has a lot of
"moving parts" (instances of the "extra complexity" that Dave
mentioned), and apparently you have to have enough outages to upgrade
components to keep it reliable that it rather undermines the uptime.

My suspicion (not actually confirmable with real numbers; all I can do
is hand-wave) is that if we had spent the costs put into HACMP on
otherwise beefing up the Golden Servers, we'd probably have had better
reliability out of depending on the individual boxes to be reliable.

In any case, whatever you use for this, whether Slony-I, with
"automatic failover" scripts, or some sort of "heartbeating/server
takeover" scheme, will suffer from the "too many complex components"
problem.

A vital problem is that it's really hard to validate that the
production configuration is correct. If you made a mistake, it'll all
blow up. And you don't want to run tests that might blow everything
up, do you? :-)
--
let name="cbbrowne" and tld="gmail.com" in name ^ "@" ^ tld;;
http://linuxdatabases.info/info/linuxdistributions.html
"Now, if someone proposed using people who spam comp.sys.* groups with
political screeds in place of lab rats for drug testing, I'd
wholeheartedly concur". -- John C. Randolph

In response to

Browse pgsql-general by date

  From Date Subject
Next Message John DeSoi 2006-05-31 12:51:32 Re: Lossy character conversion to Latin-1
Previous Message Philippe Lang 2006-05-31 12:23:59 PGSQL 7.4 -> 8.1 migration & performance problem