Quick Links

Re: [mail] Re: Big 7.4 items - Replication

From:	"Al Sutton" <al(at)alsutton(dot)com>
To:	"Jonathan Stanton" <jonathan(at)cnds(dot)jhu(dot)edu>
Cc:	"Darren Johnson" <darren(at)up(dot)hrcoxmail(dot)com>, "Bruce Momjian" <pgman(at)candle(dot)pha(dot)pa(dot)us>, "Jan Wieck" <JanWieck(at)Yahoo(dot)com>, <shridhar_daithankar(at)persistent(dot)co(dot)in>, "PostgreSQL-development" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: [mail] Re: Big 7.4 items - Replication
Date:	2002-12-15 22:15:40
Message-ID:	001301c2a487$813b69d0$0100a8c0@cloud
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Jonathan,

Many thanks for clarifying the situation some more. With token passing, I
have the following concerns;

1) What happends if a server holding the token should die whilst it is in
posession of the token.

2) If I have n servers, and the time to pass the token between each server
is x milliseconds, I may have to wait for upto m times x milliseconds in
order for a transaction to be processed. If a server is limited to a single
transaction per posession of the token (in order to ensure no system hogs
the token), and the server develops a queue of length y, I will have to wait
m times x times y for the transaction to be processed. Both scenarios I
beleive would not scale well beyond a small subset of servers with low
network latency between them.

If we consider the following situation I can illustrate why I'm still in
favour of a two phase commit;

Imagine, for example, credit card details about the status of an account
replicated in real time between databases in London, Moscow, Singapore,
Syndey, and New York. If any server can talk to any other server with a
guarenteed packet transfer time of 150ms a two phase commit could complete
in 600ms as it's worst case (assuming that the two phases consist of
request/response pairs, and that each server talks to all the others in
parallel). A token passing system may have to wait for the token to pass
through every other server before reaching the one that has the transaction
comitted to it, which could take about 750ms.

If you then expand the network to allow for a primary and disaster recover
database at each location the two phase commit still maintains it's 600ms
response time, but the token passing system doubles to 1500ms.

Allowing disjointed segments to continue executing is also a concern because
any split in the replication group could effectively double the accepted
card limit for any card holder should they purchase items from various
locations around the globe.

I can see an idea that the token may be passed to the system with the most
transactions in a wait state, but this would cause low volume databases to
loose out on response times to higher volume ones, which is again,
undesirable.

Al.

----- Original Message -----
From: "Jonathan Stanton" <jonathan(at)cnds(dot)jhu(dot)edu>
To: "Al Sutton" <al(at)alsutton(dot)com>
Cc: "Darren Johnson" <darren(at)up(dot)hrcoxmail(dot)com>; "Bruce Momjian"
<pgman(at)candle(dot)pha(dot)pa(dot)us>; "Jan Wieck" <JanWieck(at)Yahoo(dot)com>;
<shridhar_daithankar(at)persistent(dot)co(dot)in>; "PostgreSQL-development"
<pgsql-hackers(at)postgresql(dot)org>
Sent: Sunday, December 15, 2002 9:17 PM
Subject: Re: [mail] Re: [HACKERS] Big 7.4 items - Replication

> On Sun, Dec 15, 2002 at 07:42:35PM -0000, Al Sutton wrote:
> > Jonathan,
> >
> > How do the group communication daemons on system A and B agree that T2
is
> > after T1?,
>
> Lets split this into two separate problems:
>
> 1) How do the daemons totally order a set of messages (abstract
> messages)
>
> 2) How do database transactions get split into writesets that are sent
> as messages through the group communication system.
>
> As to question 1, the set of daemons (usually one running on each
> participating server) run a distributed ordering algorithm, as well as
> distributed algorithms to provide message reliability, fault-detection,
> and membership services. These are completely distributed algorithms, no
> "central" controller node exists, so even if network partitions occur
> the group communication system keeps running and providing ordering and
> reliability guarantees to messages.
>
> A number of different algorithms exist as to how to provide a total
> order on messages. Spread currently uses a token algorithm, that
> involves passing a token between the daemons, and a counter attached to
> each message, but other algorithms exist and we have implemneted some
> other ones in our research. You can find lots of details in the papers
> at www.cnds.jhu.edu/publications/ and www.spread.org.
>
> As to question 2, there are several different approaches to how to use
> such a total order for actual database replication. They all use the gcs
> total order to establish a single sequence of "events" that all the
> databases see. Then each database can act on the events as they are
> delivered by teh gcs and be guaranteed that no other database will see a
> different order.
>
> In the postgres-R case, the action received from a client is performned
> partially at the originating postgres server, the writesets are then
> sent through the gcs to order them and determine conflicts. Once they
> are delivered back, if no conflicts occured in the meantime, the
> original transaction is completed and the result returned to the client.
> If a conflict occured, the original transaction is rolled back and
> aborted. and the abort is returned to the client.
>
> >
> > As I understand it the operation is performed locally before being
passed on
> > to the group for replication, when T2 arrives at system B, system B has
no
> > knowlege of T1 and so can perform T2 sucessfully.
> >
> > I am guessing that the System B performs T2 locally, sends it to the
group
> > communication daemon for ordering, and then receives it back from the
group
> > communication order queue after it's position in the order queue has
been
> > decided before it is written to the database.
>
> If I understand the above correctly, yes, that is the same as I describe
> above.
>
> >
> > This would indicate to me that there is a single central point which
decides
> > that T2 is after T1.
>
> No, there is a distributed algorithm that determins the order. The
> distributed algorithm "emulates" a central controller who decides the
> order, but no single controller actually exists.
>
> Jonathan
>
> > ----- Original Message -----
> > From: "Jonathan Stanton" <jonathan(at)cnds(dot)jhu(dot)edu>
> > To: "Al Sutton" <al(at)alsutton(dot)com>
> > Cc: "Darren Johnson" <darren(at)up(dot)hrcoxmail(dot)com>; "Bruce Momjian"
> > <pgman(at)candle(dot)pha(dot)pa(dot)us>; "Jan Wieck" <JanWieck(at)Yahoo(dot)com>;
> > <shridhar_daithankar(at)persistent(dot)co(dot)in>; "PostgreSQL-development"
> > <pgsql-hackers(at)postgresql(dot)org>
> > Sent: Sunday, December 15, 2002 5:00 PM
> > Subject: Re: [mail] Re: [HACKERS] Big 7.4 items - Replication
> >
> >
> > > The total order provided by the group communication daemons guarantees
> > > that every member will see the tranactions/writesets in the same
order.
> > > So both A and B will see that T1 is ordered before T2 BEFORE writing
> > > anything back to the client. So for both servers T1 will be completed
> > > successfully, and T2 will be aborted because of conflicting writesets.
> > >
> > > Jonathan
> > >
> > > On Sun, Dec 15, 2002 at 10:16:22AM -0000, Al Sutton wrote:
> > > > Many thanks for the explanation. Could you explain to me where the
order
> > or
> > > > the writeset for the following scenario;
> > > >
> > > > If a tranasction takes 50ms to reach one database from another, for
a
> > > > specific data element (called X), the following timeline occurs
> > > >
> > > > at 0ms, T1(X) is written to system A.
> > > > at 10ms, T2(X) is written to system B.
> > > >
> > > > Where T1(X) and T2(X) conflict.
> > > >
> > > > My concern is that if the Group Communication Daemon (gcd) is
operating
> > on
> > > > each database, a successful result for T1(X) will returned to the
> > client
> > > > talking to database A because T2(X) has not reached it, and thus no
> > conflict
> > > > is known about, and a sucessful result is returned to the client
> > submitting
> > > > T2(X) to database B because it is not aware of T1(X). This would
mean
> > that
> > > > the two clients beleive bothe T1(X) and T2(X) completed succesfully,
yet
> > > > they can not due to the conflict.
> > > >
> > > > Thanks,
> > > >
> > > > Al.
> > > >
> > > > ----- Original Message -----
> > > > From: "Darren Johnson" <darren(at)up(dot)hrcoxmail(dot)com>
> > > > To: "Al Sutton" <al(at)alsutton(dot)com>
> > > > Cc: "Bruce Momjian" <pgman(at)candle(dot)pha(dot)pa(dot)us>; "Jan Wieck"
> > > > <JanWieck(at)Yahoo(dot)com>; <shridhar_daithankar(at)persistent(dot)co(dot)in>;
> > > > "PostgreSQL-development" <pgsql-hackers(at)postgresql(dot)org>
> > > > Sent: Saturday, December 14, 2002 6:48 PM
> > > > Subject: Re: [mail] Re: [HACKERS] Big 7.4 items - Replication
> > > >
> > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >b) The Group Communication blob will consist of a number of
processes
> > > > which
> > > > > >need to talk to all of the others to interrogate them for changes
> > which
> > > > may
> > > > > >conflict with the current write that being handled and then issue
the
> > > > > >transaction response. This is basically the two phase commit
solution
> > > > with
> > > > > >phases moved into the group communication process.
> > > > > >
> > > > > >I can see the possibility of using solution b and having less
group
> > > > > >communication processes than databases as attempt to simplify
things,
> > but
> > > > > >this would mean the loss of a number of databases if the machine
> > running
> > > > the
> > > > > >group communication process for the set of databases is lost.
> > > > > >
> > > > > The group communication system doesn't just run on one system.
For
> > > > > postgres-r using spread
> > > > > there is actually a spread daemon that runs on each database
server.
> > It
> > > > > has nothing to do with
> > > > > detecting the conflicts. Its job is to deliver messages in a
total
> > > > > order for writesets or simple order
> > > > > for commits, aborts, joins, etc.
> > > > >
> > > > > The detection of conflicts will be done at the database level, by
a
> > > > > backend processes. The basic
> > > > > concept is "if all databases get the writesets (changes) in the
exact
> > > > > same order, apply them in a
> > > > > consistent order, avoid conflicts, then one copy serialization is
> > > > > achieved. (one copy of the database
> > > > > replicated across all databases in the replica)
> > > > >
> > > > > I hope that explains the group communication system's
responsibility.
> > > > >
> > > > > Darren
> > > > >
> > > > >
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > ---------------------------(end of
> > broadcast)---------------------------
> > > > > TIP 5: Have you checked our extensive FAQ?
> > > > >
> > > > > http://www.postgresql.org/users-lounge/docs/faq.html
> > > >
> > > >
> > > >
> > > > ---------------------------(end of
broadcast)---------------------------
> > > > TIP 6: Have you searched our list archives?
> > > >
> > > > http://archives.postgresql.org
> > >
> > > --
> > > -------------------------------------------------------
> > > Jonathan R. Stanton jonathan(at)cs(dot)jhu(dot)edu
> > > Dept. of Computer Science
> > > Johns Hopkins University
> > > -------------------------------------------------------
> > >
> >
> >
>
> --
> -------------------------------------------------------
> Jonathan R. Stanton jonathan(at)cs(dot)jhu(dot)edu
> Dept. of Computer Science
> Johns Hopkins University
> -------------------------------------------------------
>

In response to

Re: [mail] Re: Big 7.4 items - Replication at 2002-12-15 19:42:35 from Al Sutton

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Ian Barwick	2002-12-15 23:36:26	Re: Patch for DBD::Pg pg_relcheck problem
Previous Message	Jeroen T. Vermeulen	2002-12-15 22:08:49	Re: PQnotifies() in 7.3 broken?