Re: Synchronization levels in SR

From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Greg Smith <greg(at)2ndquadrant(dot)com>
Cc: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Dimitri Fontaine <dfontaine(at)hi-media(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Synchronization levels in SR
Date: 2010-06-02 09:36:59
Message-ID: 1275471419.21465.2154.camel@ebony
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, 2010-06-02 at 03:22 -0400, Greg Smith wrote:
> Heikki Linnakangas wrote:
> > The possibilities are endless... Your proposal above covers a pretty
> > good set of scenarios, but it's by no means complete. If we try to
> > solve everything the configuration will need to be written in a
> > Turing-complete Replication Description Language. We'll have to pick a
> > useful, easy-to-understand subset that covers the common scenarios. To
> > handle the more exotic scenarios, you can write a proxy that sits in
> > front of the master, and implements whatever rules you wish, with the
> > rules written in C.
>
> I was thinking about this a bit recently. As I see it, there are three
> fundamental parts of this:
>
> 1) We have a transaction that is being committed. The rest of the
> computations here are all relative to it.
>
> 2) There is an (internal?) table that lists the state of each
> replication target relative to that transaction. It would include the
> node name, perhaps some metadata ('location' seems the one that's most
> likely to help with the remote data center issue), and a state code.
> The codes from http://wiki.postgresql.org/wiki/Streaming_Replication
> work fine for the last part (which is the only dynamic one--everything
> else is static data being joined against):
>
> async=hasn't received yet
> recv=been received but just in RAM
> fsync=received and synced to disk
> apply=applied to the database
>
> These would need to be enums so they can be ordered from lesser to
> greater consistency.
>
> So in a 3 node case, the internal state table might look like this after
> a bit of data had been committed:
>
> node | location | state
> ----------------------------------
> a | local | fsync
> b | remote | recv
> c | remote | async
>
> This means that the local node has a fully persistent copy, but the best
> either remote one has done is received the data, it's not on disk at all
> yet at the remote data center. Still working its way through.
>
> 3) The decision about whether the data has been committed to enough
> places to be considered safe by the master is computed by a function
> that is passed this internal table as something like a SRF, and it
> returns a boolean. Once that returns true, saying it's satisfied, the
> transaction closes on the master and continues to percolate out from
> there. If it's false, we wait for another state change to come in and
> return to (2).
>
> I would propose that most behaviors someone has expressed as being their
> desired implementation is possible to implement using this scheme.
>
> -Semi-sync commit: proceed as soon somebody else has a copy and hope
> the copies all become consistent: EXISTS WHERE state>=recv
> -Don't proceed until there's a fsync'd commit on at least one of the
> remote nodes: EXISTS WHERE location='remote' AND state>=fsync
> -Look for a quorum of n commits of fsync quality: CASE WHEN (SELECT
> COUNT(*) WHERE state>=fsync)>n THEN true else FALSE end;
>
> Syntax is obviously rough but I think you can get the drift of what I'm
> suggesting.
>
> While exposing the local state and running this computation isn't free,
> in situations where there truly are remote nodes in here being
> communicated with the network overhead is going to dwarf that. If there
> were a fast path for the simplest cases and this complicated one for the
> rest, I think you could get the fully programmable behavior some people
> want using simple SQL, rather than having to write a new "Replication
> Description Language" or something so ambitious. This data about what's
> been replicated to where looks an awful lot like a set of rows you can
> operate on using features already in the database to me.

I think we're all agreed on the 4 levels: async, recv, fsync, apply.

I also like the concept of a synchronisation/wakeup rule as an abstract
concept. Certainly makes things easier to discuss.

The inputs to the wakeup rule can be defined in different ways. Holding
per-node state at local level looks too complex to me. I'm not
suggesting that we need both per-node AND per-transaction options
interacting at the same time. (That would be a clear argument against
per-transaction options, if that was a requirement - its not, for me).

There seems to be a simpler way: a service oriented model. The
transaction requests a minimum level of synchronisation, the standbys
together service that request. A simple, clear process:

1. If transaction requests recv, fsync or apply, backend sleeps in the
appropriate queue

2. An agent on behalf of the remote standby provides feedback according
to the levels of service defined for that standby.

3. The agent calls a wakeup-rule to see if the backend can be woken yet

The most basic rule is "first-standby-wakes" meaning that the first
standby to provide feedback that the required synchronisation level has
been met by at least one standby will cause the rule to fire.

The next most basic thing is that some standbys can be marked as not
taking part in the quorum and are not capable of waking people with
certain levels of request.

Rules can record intermediate data, to allow us to wait until multiple
agents have provided feedback.

Another rule might be "wait for all standbys marked apply". Though that
has complex behaviour in failure conditions.

Other more complex rules are possible. That can be defined explicitly
with some DDL/RDL, or we could use a plugin, or just hardcode some
options. That level of complexity is secondary and can be added later.
That is especially easy once we have the concept of a synchronisation
wakeup rule as Greg describes here.

Being able to apply synchronisation level at transaction level is very
important. All other systems only provide a single level of
synchronisation, making application design difficult. It will be a
compelling feature for application designers to be able to make their
reference data updated at APPLY, so always consistent everywhere, while
other important data is FSYNC, and less critical data is ASYNC. It's a
real pain to have to partition applications because of their
synchronisation requirements.

--
Simon Riggs www.2ndQuadrant.com

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Russell Smith 2010-06-02 10:38:35 Re: Idea for getting rid of VACUUM FREEZE on cold pages
Previous Message Heikki Linnakangas 2010-06-02 09:28:58 Re: obsolete comments in xlog.c