Skip site navigation (1) Skip section navigation (2)

partial replication ...

From: Hans-Juergen Schoenig -- PostgreSQL <postgres(at)cybertec(dot)at>
To: pgsql-cluster-hackers(at)postgresql(dot)org, Zoltan Boszormenyi <zb(at)cybertec(dot)at>
Subject: partial replication ...
Date: 2009-11-20 10:35:27
Message-ID: (view raw, whole thread or download thread mbox)
Lists: pgsql-cluster-hackers
hello everybody,

we are glad that there is finally a list for clustering and i hope we 
can bring things forward here.
from my side we are facing the following situation: we got the funding 
to do partial wal-based replication and we have even proposed a concept 
for that the week before the paris conference. the result was: no 
responses ...
so, we got the cash, we got the technical staff to do it and we were 
surprised that nobody dared to answer after all.
in other words; i would like to resubmit this proposal here and discuss 
it so that we can finally move forward.
my plan is to make sure that there is "not one more island solution" and 
that things have a perspective to go into core to fix the needs of so 
many people out there.

so, here is some adapted proposal i sent some time ago. i would be happy 
if we could get a discussion going:

Partial WAL Replication for PostgreSQL:

As of now the PostgreSQL community has provided patches and functionalities
which allow full WAL-based replication as well as hot-standby. To extend 
functionality and to make PostgreSQL even more suitable for "enterprise"
computing than it is today, we have the commitment of a sponsor to fund 
replication for PostgreSQL 8.5 / 8.6.

This is the first draft of a proposal to make partial WAL-based 
replication work
and to provide and additional set of fancy features to the community 
which has
been waiting for real in-core replication for a decade or more.

Why partial replication?

In some cases people have master servers which contain enormous amounts 
of data
(XX TB or so). If more than just one replica of this data is needed it 
happen that different slaves are used for different purposes.  This 
implies that
not all data will be used by all maschines.
An example: Consider a server at a phone company collecting phone calls, 
data, and maybe network routing data. Data is used by different 
department and
one maschine is not enough to serve all three departments. With the new
functionality proposed here we could make 3 replicas each holding just a 
of tables for specific tasks thus allowing people to buy cheaper 
hardware for
slaves and use more maschines instead.

Current status:

Hot-standy and streaming replication have been a huge leap step forward 
for the
community and what is proposed here will be an extension to those 
patches and
functionalities. This concept is NOT aimed to replace any of those 
patches -
it is mainly an addon.

Nodes and replication filters:

As of 8.4 standby systems are done by creating an archive_command along 
with a
base backup. Although it is easy to do some users still reported some
difficulties due to a total misunderstanding of PITR.

The idea is to add a functionality to add slaves like this:

    CONNECT FROM SLAVE 'connect_string'
    TRANSFER COMMAND 'command'
    [ USING replication_filter ];

'command' would be any shell script copying data from the local master 
to the
new database node called node_name. Replication filters can be used to 
make X
replicas contain the same tables. Filtersets can be created like this:


Replication filters can be modified ...

ALTER REPLICATION FILTER filter_name RENAME TO new_filtername;

Filter sets can be dropped like this ...


Internally CREATE REPLICA would initiate a base backup to the new slave 
just like we would do it manually otherwise. The server would 
automatically use
the user defined 'command' to copy one file after the other to the slave 
The idea is basically stolen from archive_command and friends. At this 
stage we
either copy the entire instance as we would do it with a normal base 
backup or
just what is needed (defined by the replication filter). Users would
automatically only copy data to a slave which is really needed there and 
matches their filter config. If the copy is done, we can register the 
new node
inside a system table and commit the transaction. Also, we can automatically
create a useful recovery.conf setup - we know how to connect from the 
slave to
the master (we can use ' CONNECT FROM SLAVE [ USING ] ' to write a proper
recovery.conf file).

Tables can easily be added or removed from a replication filter with ALTER

Replicas can be removed easily:

DROP REPLICA node_name;

Why SQL to add a node? We are convinced that this is the most simplistic 
way of
doing things.  It is the most intuitive way of doing things.  We believe it
gives users a real feeling of simplicity. The current way of doing base 
should stay in place as it is - it has proven to be nice for countless 
However, it is not suitable for managing 10 or more replicas easily. 
not when they are not full blown copies of the master.

Technical ideas:

System tables:

We suggest to always replicate the entire system catalog.  It woulde be 
a total
disaster to try some other implementation. The same applies for other 
tables - we
always replicate entire tables; no WHERE-clauses allowed when it comes to
replicating any table.
How can a query on the slave figure out if a table is around? The slave 
just to
know "who it is". Then it can lookup easily from the replication filter 
it is
using if a table is actually physically in place or not. If a table is 
not in
place, we can easily error out.

Remove a table from the slave:

This is not too hard; the master received the command to kill a table 
the slave.
We will send a request to remove all storage files related to a table 
and adjust
the replication filter to make sure that the slave will not replay 
content of
this table anymore.

Add a table to a slave:

This is slightly more tricky. We start collecting WAL for a table, stop 
WAL, use the TRANSFER COMMAND to copy the files related to the table 
added and
resume recovery / sending once the storage file is on the slave.
What will work for one table will of course also work for a database / 
or whatever set of data sets.

Addition stuff:

Of course there are many more consistency considerations here. We cannot
replicate an index if the table is not present, etc.


We have some ideas on our own here but at this stage we are most 
interested in
gathering community ideas to other people would do it. I want to avoid 
the usual
"yes we want that but we hate this detail therefore it is shit" game at 
this stage
but come up with an overall concept.

btw; response and comments of any kind are welcome.

    many thanks,


Cybertec Schoenig & Schoenig GmbH
Reyergasse 9 / 2
A-2700 Wiener Neustadt

pgsql-cluster-hackers by date

Next:From: Koichi SuzukiDate: 2009-11-24 01:47:38
Subject: Re: TODO: Wiki page for every project
Previous:From: Selena DeckelmannDate: 2009-11-20 02:51:28
Subject: TODO: Wiki page for every project

Privacy Policy | About PostgreSQL
Copyright © 1996-2017 The PostgreSQL Global Development Group