partial replication ...

From: Hans-Juergen Schoenig -- PostgreSQL <postgres(at)cybertec(dot)at>
To: pgsql-cluster-hackers(at)postgresql(dot)org, Zoltan Boszormenyi <zb(at)cybertec(dot)at>
Subject: partial replication ...
Date: 2009-11-20 10:35:27
Message-ID: 4B0670EF.7040104@cybertec.at
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-cluster-hackers

hello everybody,

we are glad that there is finally a list for clustering and i hope we
can bring things forward here.
from my side we are facing the following situation: we got the funding
to do partial wal-based replication and we have even proposed a concept
for that the week before the paris conference. the result was: no
responses ...
so, we got the cash, we got the technical staff to do it and we were
surprised that nobody dared to answer after all.
in other words; i would like to resubmit this proposal here and discuss
it so that we can finally move forward.
my plan is to make sure that there is "not one more island solution" and
that things have a perspective to go into core to fix the needs of so
many people out there.

so, here is some adapted proposal i sent some time ago. i would be happy
if we could get a discussion going:

Partial WAL Replication for PostgreSQL:
---------------------------------------

As of now the PostgreSQL community has provided patches and functionalities
which allow full WAL-based replication as well as hot-standby. To extend
this
functionality and to make PostgreSQL even more suitable for "enterprise"
computing than it is today, we have the commitment of a sponsor to fund
partial
replication for PostgreSQL 8.5 / 8.6.

This is the first draft of a proposal to make partial WAL-based
replication work
and to provide and additional set of fancy features to the community
which has
been waiting for real in-core replication for a decade or more.

Why partial replication?
------------------------

In some cases people have master servers which contain enormous amounts
of data
(XX TB or so). If more than just one replica of this data is needed it
might
happen that different slaves are used for different purposes. This
implies that
not all data will be used by all maschines.
An example: Consider a server at a phone company collecting phone calls,
billing
data, and maybe network routing data. Data is used by different
department and
one maschine is not enough to serve all three departments. With the new
functionality proposed here we could make 3 replicas each holding just a
group
of tables for specific tasks thus allowing people to buy cheaper
hardware for
slaves and use more maschines instead.

Current status:
---------------

Hot-standy and streaming replication have been a huge leap step forward
for the
community and what is proposed here will be an extension to those
patches and
functionalities. This concept is NOT aimed to replace any of those
patches -
it is mainly an addon.

Nodes and replication filters:
------------------------------

As of 8.4 standby systems are done by creating an archive_command along
with a
base backup. Although it is easy to do some users still reported some
difficulties due to a total misunderstanding of PITR.

The idea is to add a functionality to add slaves like this:

CREATE REPLICA node_name
CONNECT FROM SLAVE 'connect_string'
TRANSFER COMMAND 'command'
[ USING replication_filter ];

'command' would be any shell script copying data from the local master
to the
new database node called node_name. Replication filters can be used to
make X
replicas contain the same tables. Filtersets can be created like this:

CREATE REPLICATION FILTER filter_name
[ EMPTY | FULL ] [ INCLUDE | EXCLUDE CHANGES ];

Replication filters can be modified ...

ALTER REPLICATION FILTER filter_name RENAME TO new_filtername;
ALTER REPLICATION FILTER filter_name
{ ADD | REMOVE } { DATABASE | TABLE | INDEX | SEQUENCE } object;

Filter sets can be dropped like this ...

DROP REPLICATION FILTER filter_name;

Internally CREATE REPLICA would initiate a base backup to the new slave
server
just like we would do it manually otherwise. The server would
automatically use
the user defined 'command' to copy one file after the other to the slave
box.
The idea is basically stolen from archive_command and friends. At this
stage we
either copy the entire instance as we would do it with a normal base
backup or
just what is needed (defined by the replication filter). Users would
automatically only copy data to a slave which is really needed there and
which
matches their filter config. If the copy is done, we can register the
new node
inside a system table and commit the transaction. Also, we can automatically
create a useful recovery.conf setup - we know how to connect from the
slave to
the master (we can use ' CONNECT FROM SLAVE [ USING ] ' to write a proper
recovery.conf file).

Tables can easily be added or removed from a replication filter with ALTER
REPLICATION FILTER.

Replicas can be removed easily:

DROP REPLICA node_name;

Why SQL to add a node? We are convinced that this is the most simplistic
way of
doing things. It is the most intuitive way of doing things. We believe it
gives users a real feeling of simplicity. The current way of doing base
backups
should stay in place as it is - it has proven to be nice for countless
tasks.
However, it is not suitable for managing 10 or more replicas easily.
Especially
not when they are not full blown copies of the master.

Technical ideas:
----------------

System tables:

We suggest to always replicate the entire system catalog. It woulde be
a total
disaster to try some other implementation. The same applies for other
tables - we
always replicate entire tables; no WHERE-clauses allowed when it comes to
replicating any table.
How can a query on the slave figure out if a table is around? The slave
just to
know "who it is". Then it can lookup easily from the replication filter
it is
using if a table is actually physically in place or not. If a table is
not in
place, we can easily error out.

Remove a table from the slave:

This is not too hard; the master received the command to kill a table
the slave.
We will send a request to remove all storage files related to a table
and adjust
the replication filter to make sure that the slave will not replay
content of
this table anymore.

Add a table to a slave:

This is slightly more tricky. We start collecting WAL for a table, stop
shipping
WAL, use the TRANSFER COMMAND to copy the files related to the table
added and
resume recovery / sending once the storage file is on the slave.
What will work for one table will of course also work for a database /
schema
or whatever set of data sets.

Addition stuff:

Of course there are many more consistency considerations here. We cannot
replicate an index if the table is not present, etc.

Implementation:

We have some ideas on our own here but at this stage we are most
interested in
gathering community ideas to other people would do it. I want to avoid
the usual
"yes we want that but we hate this detail therefore it is shit" game at
this stage
but come up with an overall concept.

btw; response and comments of any kind are welcome.

many thanks,

hans

--
Cybertec Schoenig & Schoenig GmbH
Reyergasse 9 / 2
A-2700 Wiener Neustadt
Web: www.postgresql-support.de

Browse pgsql-cluster-hackers by date

  From Date Subject
Next Message Koichi Suzuki 2009-11-24 01:47:38 Re: TODO: Wiki page for every project
Previous Message Selena Deckelmann 2009-11-20 02:51:28 TODO: Wiki page for every project