Re: Proposal: Commit timestamp

From: Jan Wieck <JanWieck(at)Yahoo(dot)com>
To: Markus Schiltknecht <markus(at)bluegap(dot)ch>
Cc: Zeugswetter Andreas ADI SD <ZeugswetterA(at)spardat(dot)at>, Theo Schlossnagle <jesus(at)omniti(dot)com>, Peter Eisentraut <peter_e(at)gmx(dot)net>, pgsql-hackers(at)postgresql(dot)org, Bruce Momjian <bruce(at)momjian(dot)us>, Jim Nasby <decibel(at)decibel(dot)org>
Subject: Re: Proposal: Commit timestamp
Date: 2007-02-08 01:13:10
Message-ID: 45CA7926.6070905@Yahoo.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2/7/2007 12:54 PM, Markus Schiltknecht wrote:
> Hi,
>
> Jan Wieck wrote:
>> Are we still discussing if the Postgres backend may provide support for
>> a commit timestamp, that follows the rules for Lamport timestamps in a
>> multi-node cluster?
>
> No. And I think you know my opinion about that by now. ;-)

Then let me give you a little puzzle just for the fun of it.

A database containing customer contact information (among other things)
is a two node multimaster system. One is serving the customer web
portal, the other is used by the company staff including the call
center. At 13:45 the two servers lose connectivity to each other, yet
the internal staff can access the internal server while the web portal
is accessible from the outside. At 13:50 customer A updates their credit
card information through the web portal, while customer B does the same
through the call center. At 13:55 both customers change their mind to
use yet another credit card, now customer A phones the call center while
customer B does it via the internet.

At 14:00 the two servers reconnect and go through the conflict
resolution. How do you intend to solve both conflicts without using any
"clock", because that seems to be a stopword causing instant rejection
of whatever you propose. Needless to say, both customers will be
dissatisfied if you charge the "wrong" credit card during your next
billing cycle.

>
>> It seems more like we are drifting into what type of
>> replication system I should design to please most people.
>
> Nobody is telling you what you should do. You're free to do whatever you
> want to.
>
> I'm only trying to get a discussion going, because a) I'm interested in
> how you plan to solve these problems and b) in the past, most people
> were complaining that all the different replication efforts didn't try
> to work together. I'm slowly trying to open up and discuss what I'm
> doing with Postgres-R on the lists.

Which is a good discussion because one of the reasons why I stopped
looking into Postgres-R is the fact that is based on the idea to push
all the replication information through a system that generates a global
serialized message queue. That by itself isn't the problem, but the fact
that implementing a global serialized message queue has serious
throughput issues that are (among other details) linked to the speed of
light.

I am trying to start with a system, that doesn't rely on such a
mechanism for everything. I do intend to add an option later, that
allows to declare a UNIQUE NOT NULL constraint to be synchronous. What
that means is, that any INSERT, UPDATE, DELETE and SELECT FOR UPDATE
will require the node to currently be a member of the (quorum or
priority defined) majority of the cluster. An advisory lock system,
based on a total order group communication, will grant the lock to the
unique key values on a first come, first serve base. Every node in the
cluster will keep those keys as "locked" until the asynchronous
replication stream reports the locking transaction as ended. If another
remote transaction in the meantime requires updating such key, the
incoming stream from that node will be on hold until the lock is
cleared. This is to protect agains node B replicating a transaction from
node A and a later update on node B arrives on C before C got the first
event from A. A node that got disconnected from the cluster must rebuild
the current advisory lock list upon reconnecting to the cluster.

I think that this will be a way to overcome Postgres-R's communication
bottleneck, as well as allowing limited update activity even during a
completely disconnected state of a node. Synchronous or group
communication messages are reduced to the cases, where the application
cannot be implemented in a conflict free way, like allocating a natural
primary key. There is absolutely no need to synchronize for example
creating a sales order. An application can use global unique ID's for
the order number. And everything possibly referenced by an order (items,
customers, ...) is stored in a way that the references are never
updated. Deletes to those possibly referenced objects are implemented in
a two step process, where they are first marked obsolete, and later on
things that have been marked obsolete for X long are deleted. A REPLICA
TRIGGER on inserting an order will simply reset the obsolete flag of
referenced objects. If a node is disconnected longer than X, you have a
problem - hunt down the guy who defined X.

> Just yesterday at the SFPUG meeting, I've experienced how confusing it
> is for the users to have such a broad variety of (existing and upcoming)
> replication solutions. And I'm all for working together and probably
> even for merging different replication solutions.

Merging certain ideas to come up with an async/sync hybrid? Seems to me
we have similar enough ideas to need conflict resolution, because we had
them simultaneously but communicate them asynchronously.

Jan

--
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================== JanWieck(at)Yahoo(dot)com #

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Markus Schiltknecht 2007-02-08 02:27:52 Re: Proposal: Commit timestamp
Previous Message Jeremy Drake 2007-02-08 00:20:25 Re: WIP patch adding new regexp functions