Quick Links

Re: Replication Ideas

From:	Jan Wieck <JanWieck(at)Yahoo(dot)com>
To:	Chris Travers <chris(at)travelamericas(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Alvaro Herrera <alvherre(at)dcc(dot)uchile(dot)cl>, Ron Johnson <ron(dot)l(dot)johnson(at)cox(dot)net>, pgsql-general(at)postgresql(dot)org
Subject:	Re: Replication Ideas
Date:	2003-08-27 01:43:10
Message-ID:	3F4C0CAE.3030901@Yahoo.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-general pgsql-hackers pgsql-performance

WARNING: This is getting long ...

Postgres-R is a very interesting and inspiring idea. And I've been
kicking that concept around for a while now. What I don't like about it
is that it requires fundamental changes in the lock mechanism and that
it is based on the assumption of very low lock conflict.

<explain-PG-R>
In Postgres-R a committing transaction sends it's workset (WS - a list
of all updates done in this transaction) to the group communication
system (GC). The GC guarantees total order, meaning that all nodes will
receive all WSs in the same order, no matter how they have been sent.

If a node receives back it's own WS before any error occured, it goes
ahead and finalizes the commit. If it receives a foreign WS, it has to
apply the whole WS and commit it before it can process anything else. If
now a local transaction, in progress or while waiting for it's WS to
come back, holds a lock that is required to process such remote WS, the
local transaction needs to be aborted to unlock it's resources ... it
lost the total order race.
</explain-PG-R>

Postgres-R requires that all remote WSs are applied and committed before
a local transaction can commit. Otherwise it couldn't correctly detect a
lock conflict. So there will not be any read ahead. And since the total
order really counts here, it cannot apply any two remote WSs in
parallel, a race condition could possibly exist and a later WS in the
total order runs faster and locks up a previous one, so we have to
squeeze all remote WSs through one single replication work process. And
all the locally parallel executed transactions that wait for their WSs
to come back have to wait until that poor little worker is done with the
whole pile. Bye bye concurrency. And I don't know how the GC will deal
with the backlog either. Could well choke on it.

I do not see how this will scale well in a multi-SMP-system cluster. At
least the serialization of WSs will become a horror if there is
significant lock contention like in a standard TPC-C on the district row
containing the order number counter. I don't know for sure, but I
suspect that with this kind of bottleneck, Postgres-R will have to
rollback more than 50% of it's transactions when there are more than 4
nodes under heavy load (like in a benchmark run). That will suck ...

But ... initially I said that it is an inspiring concept ... soooo ...

I am currently hacking around with some C+PL/TclU+Spread constructs that
might form a rude kind of prototype creature.

My changes to the Postgres-R concept are that there will be as many
replicating slave processes as there are in summary masters out in the
cluster ... yes, it will try to utilize all the CPU's in the cluster!
For failover reliability, A committing transaction will hold before
finalizing the commit and send it's "I'm ready" to the GC. Every
replicator that reaches the same state send's "I'm ready" too. Spread
guarantees in SAFE_MESS mode that messages are delivered to all nodes in
a group or that at least LEAVE/DISCONNECT messages are deliverd before.
So if a node receives more than 50% of "I'm ready", there would be a
very small gap where multiple nodes have to fail in the same split
second so that the majority of nodes does NOT commit. A node that
reported "I'm ready" but lost more than 50% of the cluster before
committing has to rollback and rejoin or wait for operator intervention.

Now the idea is to split up the communication into GC distribution
groups per transaction. So working master backends and associated
replication backends will join/leave a unique group for every
transaction in the cluster. This way, the per process communication is
reduced to the required minimum.

As said, I am hacking on some code ...

Jan

Chris Travers wrote:
> Tom Lane wrote:
>
>>Chris Travers <chris(at)travelamericas(dot)com> writes:
>>
>>
>>>Yes I have. Postgres-r is not a high-availability solution which is
>>>capable of transparent failover,
>>>
>>>
>>
>>What makes you say that? My understanding is it's supposed to survive
>>loss of individual servers.
>>
>> regards, tom lane
>>
>>
>>
>>
> My mistake. I must have gotten them confused with another
> (asynchronous) replication project.
>
> Best Wishes,
> Chris Travers
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 9: the planner will ignore your desire to choose an index scan if your
> joining column's datatypes do not match

--
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================== JanWieck(at)Yahoo(dot)com #

In response to

Re: Replication Ideas at 2003-08-25 22:15:25 from Chris Travers

Responses

Re: [GENERAL] Replication Ideas at 2003-08-27 03:25:41 from Bruce Momjian
Re: Replication Ideas at 2003-08-27 04:33:59 from Dennis Gearon

Browse pgsql-general by date

	From	Date	Subject
Next Message	Ron Johnson	2003-08-27 02:17:34	Re: update entire table (with PostGreSQL alone)?
Previous Message	Ron Johnson	2003-08-27 01:37:01	Re: Books for PostgreSQL?

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Bill Moran	2003-08-27 02:11:48	Re: Hardware recommendations to scale to silly load
Previous Message	matt	2003-08-27 01:35:13	Hardware recommendations to scale to silly load

Browse pgsql-performance by date

	From	Date	Subject
Next Message	Bill Moran	2003-08-27 01:47:48	The results of my PostgreSQL/filesystem performance tests
Previous Message	matt	2003-08-27 01:35:13	Hardware recommendations to scale to silly load