Re: Geographic High-Availability/Replication

From: Markus Schiltknecht <markus(at)bluegap(dot)ch>
To: "Decibel!" <decibel(at)decibel(dot)org>
Cc: Gregory Stark <stark(at)enterprisedb(dot)com>, Ron Johnson <ron(dot)l(dot)johnson(at)cox(dot)net>, pgsql-general(at)postgresql(dot)org
Subject: Re: Geographic High-Availability/Replication
Date: 2007-08-29 13:46:07
Message-ID: 46D5789F.2090103@bluegap.ch
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hi,

Decibel! wrote:
> But is the complete transaction information safely stored on all nodes
> before a commit returns?

Good question. It depends very much on the group communication system
and the guarantees it provides for message delivery. For certain, the
information isn't safely stored on every node before commit
confirmation. Let me quickly explain those two points.

Lately, I've read a lot about different Group Communication Systems and
how they handle delivery guarantees. Spread offers an 'agreed' and a
'safe' mode, only the later guarantees that all nodes have received the
data. It's a rather expensive mode in terms of latency.

In our case, it would be sufficient if at least n nodes would confirm
having correctly received the data. That would allow for (n - 1)
simultaneously failing nodes, so that there's always at least one
correct node which has received the data, even if the sender just failed
after sending. This one node can redistribute the data to others which
didn't receive the message until all nodes have received it.

No group communication system I know of offers such fine grained levels
of delivery guarantees. Additionally, I've figured that it would be nice
to have subgroups and multiple orderings within a group. Thus - opposed
to my initial intention - I've finally started to write yet another
group communication system, providing all of these nice features.
Anyway, that's another story.

Regarding durability: given the above assumption, that at most (n - 1)
nodes fail, you don't have to care much about recovery, because there's
always at least one running node which has all the data. As we know,
reality doesn't always care about our assumptions. So, if you want to
prevent data loss due to failures of more than (n - 1) nodes, possibly
even all nodes, you'd have to do transaction logging, much like WAL, but
a cluster-wide one. Having every single node write a transaction log,
like WAL, would be rather expensive and complex during recovery, as
you'd have to mix and match all node's WALs.

Instead, I think it's better to decouple transaction logging (backup)
from ordinary operation. That gives you much more freedom. For example,
you could have nodes dedicated to and optimized for logging. But most
importantly, you have separated the problem: as long as your permanent
storage for transaction logging is living, you can recover your data. No
matter what's happening with the rest of the cluster. And the other way
around: as long as your cluster is living (i.e. no more than (n - 1)
simultaneous failures), you don't really need the transaction log.

So, before committing a transaction, a node has to wait for the delivery
of the data through the GCS *and* for the transaction logger(s) to have
written the data to permanent storage. Please note, that those two
operations can be done simultaneously, i.e. the latency does not
summarize, it's rather just the maximum of the two.

Regards

Markus

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Naz Gassiep 2007-08-29 13:46:30 Re: Etc/% timezones
Previous Message Ron Johnson 2007-08-29 13:37:26 Re: Reliable and fast money transaction design