Quick Links

Re: Proposal: Commit timestamp

From:	Markus Schiltknecht <markus(at)bluegap(dot)ch>
To:	Theo Schlossnagle <jesus(at)omniti(dot)com>
Cc:	Jan Wieck <JanWieck(at)Yahoo(dot)com>, Peter Eisentraut <peter_e(at)gmx(dot)net>, pgsql-hackers(at)postgresql(dot)org, Bruce Momjian <bruce(at)momjian(dot)us>, Jim Nasby <decibel(at)decibel(dot)org>
Subject:	Re: Proposal: Commit timestamp
Date:	2007-02-06 07:36:31
Message-ID:	45C82FFF.1010605@bluegap.ch
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Hi,

Theo Schlossnagle wrote:
> On Feb 4, 2007, at 1:36 PM, Jan Wieck wrote:
>> Obviously the counters will immediately drift apart based on the
>> transaction load of the nodes as soon as the network goes down. And in
>> order to avoid this "clock" confusion and wrong expectation, you'd
>> rather have a system with such a simple, non-clock based counter and
>> accept that it starts behaving totally wonky when the cluster
>> reconnects after a network outage? I rather confuse a few people than
>> having a last update wins conflict resolution that basically rolls
>> dice to determine "last".
>
> If your cluster partition and you have hours of independent action and
> upon merge you apply a conflict resolution algorithm that has enormous
> effect undoing portions of the last several hours of work on the nodes,
> you wouldn't call that "wonky?"

You are talking about different things. Async replication, as Jan is
planning to do, is per se "wonky", because you have to cope with
conflicts by definition. And you have to resolve them by late-aborting a
transaction (i.e. after a commit). Or put it another way: async MM
replication means continuing in disconnected mode (w/o quorum or some
such) and trying to reconciliate later on. It should not matter if the
delay is just some milliseconds of network latency or three days (except
of course that you probably have more data to reconciliate).

> For sane disconnected (or more generally, partitioned) operation in
> multi-master environments, a quorum for the dataset must be
> established. Now, one can consider the "database" to be the dataset.
> So, on network partitions those in "the" quorum are allowed to progress
> with data modification and others only read.

You can do this to *prevent* conflicts, but that clearly belongs to the
world of sync replication. I'm doing this in Postgres-R: in case of
network partitioning, only a primary partition may continue to process
writing transactions. For async replication, it does not make sense to
prevent conflicts when disconnected. Async is meant to cope with
conflicts. So as to be independent of network latency.

> However, there is no
> reason why the dataset _must_ be the database and that multiple datasets
> _must_ share the same quorum algorithm. You could easily classify
> certain tables or schema or partitions into a specific dataset and apply
> a suitable quorum algorithm to that and a different quorum algorithm to
> other disjoint data sets.

I call that partitioning (among nodes). And it's applicable to sync as
well as async replication, while it makes more sense in sync replication.

What I'm more concerned about, with Jan's proposal, is the assumption
that you always want to resolve conflicts by time (except for balances,
for which we don't have much information, yet). I'd rather say that time
does not matter much if your nodes are disconnected. And (especially in
async replication) you should prevent your clients from committing to
one node and then reading from another, expecting to find your data
there. So why resolve by time? It only makes the user think you could
guarantee that order, but you certainly cannot.

Regards

Markus

In response to

Re: Proposal: Commit timestamp at 2007-02-04 19:04:27 from Theo Schlossnagle

Responses

Re: Proposal: Commit timestamp at 2007-02-06 09:25:29 from Zeugswetter Andreas ADI SD

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Jonathan Gray	2007-02-06 07:41:18	Pl/pgsql functions causing crashes in 8.2.2
Previous Message	Josh Berkus	2007-02-06 05:21:05	Re: period data type