Re: Geographic High-Availability/Replication

From: Markus Schiltknecht <markus(at)bluegap(dot)ch>
To: Bill Moran <wmoran(at)potentialtech(dot)com>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: Geographic High-Availability/Replication
Date: 2007-08-27 23:13:05
Message-ID: 46D35A81.3030909@bluegap.ch
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hello Bill,

Bill Moran wrote:
> It appears as if I miscommunicated my point. I'm not expecting
> PostgreSQL-R to break the laws of physics or anything, I'm just
> curious how it reacts. This is the difference between software
> that will be really great one day, and software that is great now.

Agreed. As Postgres-R is still a prototype, it does *currently* not
handle the situation at all. But I'm thankful for this discussion, as it
it helps me figuring out how Postgres-R *should* react. So, thank you
for pointing this out.

> Great now would mean the system would notice that it's too far behind
> and Do The Right Thing automatically. I'm not exactly sure what The
> Right Thing is, but my first guess would be force the hopelessly
> slow node out of the cluster. I expect this would be non-trivial,
> as you've have to have a way to ensure it was a problem isolated to
> a single (or few) nodes, and not just the whole cluster getting hit
> with unexpected traffic.

Hm.. yeah, that's a tricky decision to make. For a start, I'd be in
favor of just informing the administrator about the delay and let him
take care of the problem (as currently done with 'disk full'
conditions). Instead of trying to do something clever automatically.
(This seems to be much more PostgreSQL-like, too).

> Of course not, that's why the behaviour when that non-ideal situation
> occurs is so interesting. How does PostgreSQL-R fail? PostgreSQL
> fails wonderfully: A hardware crash will usually result in a system
> that can recover without operator intervention. In a system like
> PostgreSQL-R, the failure scenarios are more numerous, and probably
> more complicated.

I agree that there are more failure scenarios. Although fewer are
critical to the complete system.

IMO, a node which is too slow should not be considered a failure, but
rather a system limitation (possibly due to unfortunate configuration),
much like out of memory or disk space conditions. Forcing such a node to
go down could have unwanted side effects on the other nodes (i.e.
increased read-only traffic) *and* does not solve the real problem.

Again, thanks for pointing this out. I'll think more about some issues,
especially similar corner cases like this one. Single-node disk full
would be another example. Possibly also out of memory conditions?

Regards

Markus

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Erik Jones 2007-08-27 23:14:08 Re: String Escaping in Pattern Matching
Previous Message Tom Lane 2007-08-27 23:05:17 Re: problem with transactions in VB.NET using npgsql