Quick Links

Re: Proposal: Commit timestamp

From:	Theo Schlossnagle <jesus(at)omniti(dot)com>
To:	Jan Wieck <JanWieck(at)Yahoo(dot)com>
Cc:	Theo Schlossnagle <jesus(at)omniti(dot)com>, Peter Eisentraut <peter_e(at)gmx(dot)net>, pgsql-hackers(at)postgresql(dot)org, Bruce Momjian <bruce(at)momjian(dot)us>, Jim Nasby <decibel(at)decibel(dot)org>
Subject:	Re: Proposal: Commit timestamp
Date:	2007-02-04 15:53:32
Message-ID:	051DCCC6-7D83-4AA1-B6F6-4035836E56A4@omniti.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Feb 4, 2007, at 10:06 AM, Jan Wieck wrote:

> On 2/4/2007 3:16 AM, Peter Eisentraut wrote:
>> Jan Wieck wrote:
>>> This is all that is needed for last update wins resolution. And as
>>> said before, the only reason the clock is involved in this is so
>>> that
>>> nodes can continue autonomously when they lose connection without
>>> conflict resolution going crazy later on, which it would do if they
>>> were simple counters. It doesn't require microsecond synchronized
>>> clocks and the system clock isn't just used as a Lamport timestamp.
>> Earlier you said that "one assumption is that all servers in the
>> multimaster cluster are ntp synchronized", which already rung the
>> alarm bells in me. Now that I read this you appear to require
>> synchronization not on the microsecond level but on some level. I
>> think that would be pretty hard to manage for an administrator,
>> seeing that NTP typically cannot provide such guarantees.
>
> Synchronization to some degree is wanted to avoid totally
> unexpected behavior. The conflict resolution algorithm itself can
> perfectly fine live with counters, but I guess you wouldn't want
> the result of it. If you update a record on one node, then 10
> minutes later you update the same record on another node.
> Unfortunately, the nodes had no communication and because the first
> node is much busier, its counter is way advanced ... this would
> mean the 10 minutes later update would get lost in the conflict
> resolution when the nodes reestablish communication. They would
> have the same data at the end, just not what any sane person would
> expect.
>
> This behavior will kick in whenever the cross node conflicting
> updates happen close enough so that the time difference between the
> clocks can affect it. So if you update the logical same row on two
> nodes within a tenth of a second, and the clocks are more than that
> apart, the conflict resolution can result in the older row to
> survive. Clock synchronization is simply used to minimize this.
>
> The system clock is used only to keep the counters somewhat
> synchronized in the case of connection loss to retain some degree
> of "last update" meaning. Without that, continuing autonomously
> during a network outage is just not practical.

A Lamport clock addresses this. It relies on a cluster-wise clock
tick. While it could be based on the system clock, it would not be
based on more than one clock. The point of the lamport clock is that
there is _a_ clock, not multiple ones.

One concept is to have a univeral clock that ticks forward (like
every second) and each node orders all their transactions inside the
second-granular tick. Then each commit would be like: {node,
clocksecond, txn#} and each time the clock ticks forward, txn# is
reset to zero. This gives you ordered txns that windowed in some
cluster-wide acceptable window (1 second). However, this is totally
broken as NTP is entirely insufficient for this purpose because of a
variety of forms of clock skew. As such, the timestamp should be
incremented via cluster consensus (one token ring or the pulse
generated by the leader of the current cluster membership quorom).

As the clock must be incremented clusterwide, the need for it to be
insync with the system clock (on any or all of the systems) is
obviated. In fact, as you can't guarantee the synchronicity means
that it can be confusing -- one expects a time-based clock to be
accurate to the time. A counter-based clock has no such expectations.

// Theo Schlossnagle
// CTO -- http://www.omniti.com/~jesus/
// OmniTI Computer Consulting, Inc. -- http://www.omniti.com/

In response to

Re: Proposal: Commit timestamp at 2007-02-04 15:06:27 from Jan Wieck

Responses

Re: Proposal: Commit timestamp at 2007-02-04 16:34:00 from Gregory Stark
Re: Proposal: Commit timestamp at 2007-02-04 18:36:03 from Jan Wieck
Re: Proposal: Commit timestamp at 2007-02-05 11:20:25 from Zeugswetter Andreas ADI SD

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Gregory Stark	2007-02-04 16:34:00	Re: Proposal: Commit timestamp
Previous Message	David Fetter	2007-02-04 15:45:28	Re: [HACKERS] writing new regexp functions