Re: Synchronization levels in SR

From: Greg Smith <greg(at)2ndquadrant(dot)com>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: Dimitri Fontaine <dfontaine(at)hi-media(dot)com>, Simon Riggs <simon(at)2ndQuadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Synchronization levels in SR
Date: 2010-06-02 07:22:42
Message-ID: 4C0606C2.4090901@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Heikki Linnakangas wrote:
> The possibilities are endless... Your proposal above covers a pretty
> good set of scenarios, but it's by no means complete. If we try to
> solve everything the configuration will need to be written in a
> Turing-complete Replication Description Language. We'll have to pick a
> useful, easy-to-understand subset that covers the common scenarios. To
> handle the more exotic scenarios, you can write a proxy that sits in
> front of the master, and implements whatever rules you wish, with the
> rules written in C.

I was thinking about this a bit recently. As I see it, there are three
fundamental parts of this:

1) We have a transaction that is being committed. The rest of the
computations here are all relative to it.

2) There is an (internal?) table that lists the state of each
replication target relative to that transaction. It would include the
node name, perhaps some metadata ('location' seems the one that's most
likely to help with the remote data center issue), and a state code.
The codes from http://wiki.postgresql.org/wiki/Streaming_Replication
work fine for the last part (which is the only dynamic one--everything
else is static data being joined against):

async=hasn't received yet
recv=been received but just in RAM
fsync=received and synced to disk
apply=applied to the database

These would need to be enums so they can be ordered from lesser to
greater consistency.

So in a 3 node case, the internal state table might look like this after
a bit of data had been committed:

node | location | state
----------------------------------
a | local | fsync
b | remote | recv
c | remote | async

This means that the local node has a fully persistent copy, but the best
either remote one has done is received the data, it's not on disk at all
yet at the remote data center. Still working its way through.

3) The decision about whether the data has been committed to enough
places to be considered safe by the master is computed by a function
that is passed this internal table as something like a SRF, and it
returns a boolean. Once that returns true, saying it's satisfied, the
transaction closes on the master and continues to percolate out from
there. If it's false, we wait for another state change to come in and
return to (2).

I would propose that most behaviors someone has expressed as being their
desired implementation is possible to implement using this scheme.

-Semi-sync commit: proceed as soon somebody else has a copy and hope
the copies all become consistent: EXISTS WHERE state>=recv
-Don't proceed until there's a fsync'd commit on at least one of the
remote nodes: EXISTS WHERE location='remote' AND state>=fsync
-Look for a quorum of n commits of fsync quality: CASE WHEN (SELECT
COUNT(*) WHERE state>=fsync)>n THEN true else FALSE end;

Syntax is obviously rough but I think you can get the drift of what I'm
suggesting.

While exposing the local state and running this computation isn't free,
in situations where there truly are remote nodes in here being
communicated with the network overhead is going to dwarf that. If there
were a fast path for the simplest cases and this complicated one for the
rest, I think you could get the fully programmable behavior some people
want using simple SQL, rather than having to write a new "Replication
Description Language" or something so ambitious. This data about what's
been replicated to where looks an awful lot like a set of rows you can
operate on using features already in the database to me.

--
Greg Smith 2ndQuadrant US Baltimore, MD
PostgreSQL Training, Services and Support
greg(at)2ndQuadrant(dot)com www.2ndQuadrant.us

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Fujii Masao 2010-06-02 07:39:47 obsolete comments in xlog.c
Previous Message Takahiro Itagaki 2010-06-02 06:46:41 Re: [BUGS] BUG #5487: dblink failed with 63 bytes connection names