Re: draft RFC: concept for partial, wal-based replication

From: Boszormenyi Zoltan <zb(at)cybertec(dot)at>
To: pgsql-hackers(at)postgresql(dot)org
Cc: hs(at)cybertec(dot)at, Andres Freund <andres(at)anarazel(dot)de>
Subject: Re: draft RFC: concept for partial, wal-based replication
Date: 2009-11-29 19:56:41
Message-ID: 4B12D1F9.3010607@cybertec.at
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

we tried to discuss on a lower level what should be needed
for a partial replication based on streaming replication.

a. transferring base data after a slave got added a relation/index/...
(and initial setup)
b. communicating the the slaves which relations they currently should
have available
c. splitting wal into different replication sets
d. some configuration frontend. Possibly directly via sql or via functions

I don't think its reasonable trying to discuss and implement this all
in one huge patch so I propose implementing at least 1) as a seperate
patch.

a) is very useful outside the context of this specific feature and
kind of a requisite so i suggest tackling that first.

Plan:
M: checkpoint, full page writes, access share lock on the relation
S: stop receiving wal
SM: Using $method transfer required base data for every required segment/fork
M: normal writes
S: restart wal replay

Do you see any fundamental problem with this?

Now there unforatunately are two different schools of thought how to implement transfering base data.
1. archive_command like transfer command taking a source path/host and target path/host
- very flexible (think e.g. not transferring the data for multiple slaves over the whole country)
- harder to setup
- more in style of classical wal archiving
2. add the capability to the WAL Streaming patch's libpq based protocol
- no additional configuration needed
- inflexible
- makes usage from non streaming replication is impossible

I favor 1. but only lightly so.

Detail Questions:
- How to deal with multiple transfer requests at the same time?
There would be a need for multiple full backup requests for
individual tables by several clients at once.
Currently pg_start_backup() isn't allowed from
two clients in parallel, the second one gets an error.
We thought that pg_start_backup() and pg_stop_backup()
can turn into simple reference counts. IIRC, WALs
are still generated and _shipped to slaves_ during
a full backup, they are simply not yet applied to
base table files. So, in this case a pg_stop_backup()
issued from a slave decreases refcount of the base backups
and the slave can simply resume applying its newly
received WALs to base files.
- Keep track of current number of transfers
We would also need a way to query the refcount of the
base backup, so if a slave dies, the master can be
recovered manually, so it can also resume applying leftover WALs.

b) The slave needs to know whether a relation got added to itself in
order to request a base backup of the relevant files.

I would suggest adding a new wal record type for this:
- DROP_RELATION_SLAVE(node_id, relation)
- ADD_RELATION_SLAVE(node_id, relation)

We thought about two ways of administering the replication set:
- slaves with full replication, optionally and explicitely excluding
relations
- slaves with minimal replication, explicitely included relations
The above WAL record types will be needed both cases,
but we also need two new catalog tables:
- the slave nodes, indicating the type of the slave above
- explicit table indications (treated depending on the type of the slave)

Any out of band communication has severe problems with
crashes/unavailability of the slave or not allowing classic, non
streaming, wal replication.

Questions:
- How to deal with the fact that the slave may be unavailable during adding something to
its replication set?
- Possibly forbid all DDL to the table until the slave got the update

c)
What to filter:
Every slave gets a node_id which is assigned in a system catalog on the master
pg_node(nodeoid) (per cluster)
And a catalog contains all replicated relations.
pg_replication_set(nodeoid, classoid, acknowledged_on_slave) (per database)

How to filter:
Heap2, Heap, Btree, Gin, Gist, Storage, Sequence need to be filtered by database/relation.
I am by far not yet familiar enough with the relevant code to see if it is feasible and worthwile at all to filter clog, transaction and multixact per database.

Where to filter:
I propose doing so in the walsender. While this would prohibit using
classical wal based standby I do not see a big problem in that.

If done via wal streaming it would be a simple addition of a node_id in PQstartXLogStreaming. This id obviously should not be resettable to something else...

Questions:
- How to deal with access to the different database-wide catalogs?
- Storing that data cluster wide seems really ugly.
- read the code...

d) I do not have any strong or even moderate opinions about this. I think its sensible to get something prototypish, function based done before deciding about the real interface and getting into syntax wars.

Steps:

1:
- Transfer of Relations during runtime
- needed to use wal-splitting
- internally:
- Possibility 1:
- transfer_command = ... %filename%
- should not error out if data changes beneath it.
- called for every file (i.e. every fork and segment)
- how to deal with new segments?
- lock out writes after transferring the last segment and recheck?
- Possibility 2:
- Own networking protocol
- Easier to setup
- less flexible
- more work
- take heavy lock from beginning - sounds annoying but could easily dealt with later
- How to transfer:
- stop wal replay on requesting side
- checkpoint on master
- start streaming wal for relation
- pg_start_backup() equiv
- transfer relation data
- pg_stop_backup() equiv.
- restart wal replay on relevant nodes

2:
- wal splitting for Heap2, Heap, Btree, Hash, Gin, Gist
- all others either have to be replicated anyway or are semantically a bit more complex (sequences)
- lower level does not care about dependencies?
- Additional system catalogs:
pg_node(nodeoid)
pg_replication_set(nodeoid, classoid)
- Using a hook in walsender - correct place, right?
- Extending the streamin protocol to transfer the node id from slave
- 'slave_node_id' in PQstartXLogStreaming
- A slave errors out if a relation gets accessed which is not
replicated to it
- There is some need of more communication for some of the actions:
- remove table/index/... from replication set
- add table/index/... to replication set
- both require action from the slave: stop replaying wal, request base data of *
- both require action from the master: lock relation shared, checkpoint, stop wal reusage, full_page_writes
- could be started from slave
- How should master/slave communicate
- extra wal type
- crash safe
- PQgetXLogData returns -3, user requests actions from master
- how to handle crashes?
- possibly a 'acknowledged' flag in pg_replication_set?
- both is not exactly nice and race free
- Add to replication set, drop table before slave has cought up
- normal locking does not work (shutdown)
- using something around prepared transactions seems ugly
- synchronous operation?
- alter/drop table is only possible after acknowledgement from client
- wal type seems nicer
- possibly allowing to do it via "classical" wal replay

We would not build any bells and whistles around this for now.
They would work as function calls.
Later one could build a SQL-ish interface

Hans-Juergen Schoenig -- PostgreSQL írta:
> hello ...
>
> as my day has worked out quite nicely up to know i thought to f... it
> up and post a new concept which has been requested by a customer. the
> goal is to collect some feedback, ideas and so on (not to be mixed up
> with "flames").
> we have funding for this and we are trying to sort out how to do it
> the best way. comments are welcome ...
> note, this is a first draft i want to refine based on some comments.
> here we go ...
>
>
> Partial WAL Replication for PostgreSQL:
> ---------------------------------------
>
> As of now the PostgreSQL community has provided patches and
> functionalities
> which allow full WAL-based replication as well as hot-standby. To
> extend this
> functionality and to make PostgreSQL even more suitable for "enterprise"
> computing than it is today, we have the commitment of a sponsor to
> fund partial
> replication for PostgreSQL 8.5 / 8.6.
>
> This is the first draft of a proposal to make partial WAL-based
> replication work
> and to provide and additional set of fancy features to the community
> which has
> been waiting for real in-core replication for a decade or more.
>
>
> Why partial replication?
> ------------------------
>
> In some cases people have master servers which contain enormous
> amounts of data
> (XX TB or so). If more than just one replica of this data is needed it
> might
> happen that different slaves are used for different purposes. This
> implies that
> not all data will be used by all maschines.
> An example: Consider a server at a phone company collecting phone
> calls, billing
> data, and maybe network routing data. Data is used by different
> department and
> one maschine is not enough to serve all three departments. With the new
> functionality proposed here we could make 3 replicas each holding just
> a group
> of tables for specific tasks thus allowing people to buy cheaper
> hardware for
> slaves and use more maschines instead.
>
>
> Current status:
> ---------------
>
> Hot-standy and streaming replication have been a huge leap step
> forward for the
> community and what is proposed here will be an extension to those
> patches and
> functionalities. This concept is NOT aimed to replace anything - it is
> mainly an
> addon.
>
>
> Nodes and replication filters:
> ------------------------------
>
> As of 8.4 standby systems are done by creating an archive_command
> along with a
> base backup. Although it is easy to do some users still reported some
> difficulties due to a total misunderstanding of PITR.
>
> The idea is to add a functionality to add slaves like this:
>
> CREATE REPLICA node_name
> CONNECT FROM SLAVE 'connect_string'
> TRANSFER COMMAND 'command'
> [ USING replication_filter ];
>
> 'command' would be any shell script copying data from the local master
> to the
> new database node called node_name. Replication filters can be used to
> make X
> replicas contain the same tables. Filtersets can be created like this:
>
> CREATE REPLICATION FILTER filter_name
> [ EMPTY | FULL ] [ INCLUDE | EXCLUDE CHANGES ];
>
> Replication filters can be modified ...
>
> ALTER REPLICATION FILTER filter_name RENAME TO new_filtername;
> ALTER REPLICATION FILTER filter_name
> { ADD | REMOVE } { TABLE | INDEX | SEQUENCE } object;
>
> Filter sets can be dropped like this ...
>
> DROP REPLICATION FILTER filter_name;
>
> Internally CREATE REPLICA would initiate a base backup to the new
> slave server
> just like we would do it manually otherwise. The server would
> automatically use
> the user defined 'command' to copy one file after the other to the
> slave box.
> The idea is basically stolen from archive_command and friends. At this
> stage we
> either copy the entire instance as we would do it with a normal base
> backup or
> just what is needed (defined by the replication filter). Users would
> automatically only copy data to a slave which is really needed there
> and which
> matches their filter config. If the copy is done, we can register the
> new node
> inside a system table and commit the transaction. Also, we can
> automatically
> create a useful recovery.conf setup - we know how to connect from the
> slave to
> the master (we can use ' CONNECT FROM SLAVE [ USING ] ' to write a proper
> recovery.conf file).
>
> Tables can easily be added or removed from a replication filter with
> ALTER
> REPLICATION FILTER.
>
> Replicas can be removed easily:
>
> DROP REPLICA node_name;
>
> Why SQL to add a node? We are convinced that this is the most
> simplistic way of
> doing things. It is the most intuitive way of doing things. We
> believe it
> gives users a real feeling of simplicity. The current way of doing
> base backups
> should stay in place as it is - it has proven to be nice for countless
> tasks.
> However, it is not suitable for managing 10 or more replicas easily.
> Especially
> not when they are not full blown copies of the master.
>
>
> Technical ideas:
> ----------------
>
> System tables:
>
> We suggest to always replicate the entire system catalog. It woulde
> be a total
> disaster to try some other implementation. The same applies for other
> tables - we
> always replicate entire tables; no WHERE-clauses allowed when it comes to
> replicating any table.
> How can a query on the slave figure out if a table is around? The
> slave just to
> know "who it is". Then it can lookup easily from the replication
> filter it is
> using if a table is actually physically in place or not. If a table is
> not in
> place, we can easily error out.
>
>
> Remove a table from the slave:
>
> This is not too hard; the master received the command to kill a table
> the slave.
> We will send a request to remove all storage files related to a table
> and adjust
> the replication filter to make sure that the slave will not replay
> content of
> this table anymore.
>
>
> Add a table to a slave:
>
> This is slightly more tricky. We start collecting WAL for a table,
> stop shipping
> WAL, use the TRANSFER COMMAND to copy the files related to the table
> added and
> resume recovery / sending once the storage file is on the slave.
>
>
> Addition stuff:
>
> Of course there are many more consistency considerations here. We cannot
> replicate an index if the table is not present, etc.
>
>
> many thanks,
>
> hans
>
>

--
Bible has answers for everything. Proof:
"But let your communication be, Yea, yea; Nay, nay: for whatsoever is more
than these cometh of evil." (Matthew 5:37) - basics of digital technology.
"May your kingdom come" - superficial description of plate tectonics

----------------------------------
Zoltán Böszörményi
Cybertec Schönig & Schönig GmbH
http://www.postgresql.at/

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Dimitri Fontaine 2009-11-29 20:03:24 Re: Application name patch - v4
Previous Message Kurt Harriman 2009-11-29 19:11:04 Patch: Remove gcc dependency in definition of inline functions