Re: Synchronization levels in SR

From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Greg Smith <greg(at)2ndquadrant(dot)com>
Cc: Dimitri Fontaine <dfontaine(at)hi-media(dot)com>, Simon Riggs <simon(at)2ndQuadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Synchronization levels in SR
Date: 2010-06-02 09:25:20
Message-ID: 4C062380.8090108@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 02/06/10 10:22, Greg Smith wrote:
> Heikki Linnakangas wrote:
>> The possibilities are endless... Your proposal above covers a pretty
>> good set of scenarios, but it's by no means complete. If we try to
>> solve everything the configuration will need to be written in a
>> Turing-complete Replication Description Language. We'll have to pick a
>> useful, easy-to-understand subset that covers the common scenarios. To
>> handle the more exotic scenarios, you can write a proxy that sits in
>> front of the master, and implements whatever rules you wish, with the
>> rules written in C.
>
> I was thinking about this a bit recently. As I see it, there are three
> fundamental parts of this:
>
> 1) We have a transaction that is being committed. The rest of the
> computations here are all relative to it.

Agreed.

> So in a 3 node case, the internal state table might look like this after
> a bit of data had been committed:
>
> node | location | state
> ----------------------------------
> a | local | fsync b | remote | recv
> c | remote | async
>
> This means that the local node has a fully persistent copy, but the best
> either remote one has done is received the data, it's not on disk at all
> yet at the remote data center. Still working its way through.
>
> 3) The decision about whether the data has been committed to enough
> places to be considered safe by the master is computed by a function
> that is passed this internal table as something like a SRF, and it
> returns a boolean. Once that returns true, saying it's satisfied, the
> transaction closes on the master and continues to percolate out from
> there. If it's false, we wait for another state change to come in and
> return to (2).

You can't implement "wait for X to ack the commit, but if that doesn't
happen in Y seconds, time out and return true anyway" with that.

> While exposing the local state and running this computation isn't free,
> in situations where there truly are remote nodes in here being
> communicated with the network overhead is going to dwarf that. If there
> were a fast path for the simplest cases and this complicated one for the
> rest, I think you could get the fully programmable behavior some people
> want using simple SQL, rather than having to write a new "Replication
> Description Language" or something so ambitious. This data about what's
> been replicated to where looks an awful lot like a set of rows you can
> operate on using features already in the database to me.

Yeah, if we want to provide full control over when a commit is
acknowledged to the client, there's certainly no reason we can't expose
that using a hook or something.

It's pretty scary to call a user-defined function at that point in
transaction. Even if we document that you must refrain from doing nasty
stuff like modifying tables in that function, it's still scary.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Heikki Linnakangas 2010-06-02 09:28:58 Re: obsolete comments in xlog.c
Previous Message Fujii Masao 2010-06-02 07:39:47 obsolete comments in xlog.c