[RFC][PATCH] Logical Replication/BDR prototype and architecture

From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: [RFC][PATCH] Logical Replication/BDR prototype and architecture
Date: 2012-06-13 11:27:23
Message-ID: 201206131327.24092.andres@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi everyone,

This mail contains the highlevel design description of how our prototype of
in-core logical replication works. The individual patches will be posted as
replies to this email. I obviously welcome all sorts of productive comments to
both the individual patches and the architecture.

Unless somebody objects I will add most of the individual marked as RFC to the
current commitfest. I hope that with comments stemming from that round we can
get several of the patches into the first or second commitfest. As soon as the
design is clear/accepted we will try very hard to get the following patches
into the second or third round.

If anybody disaggrees with the procedual way we try do this, please raise a hand
now.

I tried to find the right balance between keeping the description short enough
that anybody will read the design docs and verbose enough that it is
understandable. I can go into much more detail in any part if wanted.

Please, keep in mind that those patches are *RFC* and a prototype and not
intended to be applied as-is. There is a short description of the individual
patches and their relevancy at the end of the email.

Greetings,

Andres

========

=== Design goals for logical replication === :
- in core
- fast
- async
- robust
- multi-master
- modular
- as unintrusive as possible implementation wise
- basis for other technologies (sharding, replication into other DBMSs, ...)

For reasons why we think this is an important set of features please check out
the presentation from the in-core replication summit at pgcon:
http://wiki.postgresql.org/wiki/File:BDR_Presentation_PGCon2012.pdf

While you may argue that most of the above design goals are already provided by
various trigger based replication solutions like Londiste or Slony, we think
that thats not enough for various reasons:

- not in core (and thus less trustworthy)
- duplication of writes due to an additional log
- performance in general (check the end of the above presentation)
- complex to use because there is no native administration interface

We want to emphasize that this proposed architecture is based on the experience
of developing a minimal prototype which we developed with the above goals in
mind. While we obviously hope that a good part of it is reusable for the
community we definitely do *not* expect that the community accepts this
+as-is. It is intended to be the basis upon which we, the community, can build
and design the future logical replication.

=== Basic architecture === :
Very broadly speaking there are several major pieces common to most approaches
to replication:
1. Source data generation
2. Transportation of that data
3. Applying the changes
4. Conflict resolution

1.:

As we need a change stream that contains all required changes in the correct
order, the requirement for this stream to reflect changes across multiple
concurrent backends raises concurrency and scalability issues. Reusing the
WAL stream for this seems a good choice since it is needed anyway and adresses
those issues already, and it further means that we don't incur duplicate
writes. Any other stream generating componenent would introduce additional
scalability issues.

We need a change stream that contains all required changes in the correct order
which thus needs to be synchronized across concurrent backends which introduces
obvious concurrency/scalability issues.
Reusing the WAL stream for this seems a good choice since it is needed anyway
and adresses those issues already, and it further means we don't duplicate the
writes and locks already performance for its maintenance.

Unfortunately, in this case, the WAL is mostly a physical representation of the
changes and thus does not, by itself, contain the necessary information in a
convenient format to create logical changesets.

The biggest problem is, that interpreting tuples in the WAL stream requires an
up-to-date system catalog and needs to be done in a compatible backend and
architecture. The requirement of an up-to-date catalog could be solved by
adding more data to the WAL stream but it seems to be likely that that would
require relatively intrusive & complex changes. Instead we chose to require a
synchronized catalog at the decoding site. That adds some complexity to use
cases like replicating into a different database or cross-version
replication. For those it is relatively straight-forward to develop a proxy pg
instance that only contains the catalog and does the transformation to textual
changes.

This also is the solution to the other big problem, the need to work around
architecture/version specific binary formats. The alternative, producing
cross-version, cross-architecture compatible binary changes or even moreso
textual changes all the time seems to be prohibitively expensive. Both from a
cpu and a storage POV and also from the point of implementation effort.

The catalog on the site where changes originate can *not* be used for the
decoding because at the time we decode the WAL the catalog may have changed
from the state it was in when the WAL was generated. A possible solution for
this would be to have a fully versioned catalog but that again seems to be
rather complex and intrusive.

For some operations (UPDATE, DELETE) and corner-cases (e.g. full page writes)
additional data needs to be logged, but the additional amount of data isn't
that big. Requiring a primary-key for any change but INSERT seems to be a
sensible thing for now. The required changes are fully contained in heapam.c
and are pretty simple so far.

2.:

For transport of the non-decoded data from the originating site to the decoding
site we decided to reuse the infrastructure already provided by
walsender/walreceiver. We introduced a new command that, analogous to
START_REPLICATION, is called START_LOGICAL_REPLICATION that will stream out all
xlog records that pass through a filter.

The on-the-wire format stays the same. The filter currently simply filters out
all record which are not interesting for logical replication (indexes,
freezing, ...) and records that did not originate on the same system.

The requirement of filtering by 'origin' of a wal node comes from the planned
multimaster support. Changes replayed locally that originate from another site
should not replayed again there. If the wal is plainly used without such a
filter that would cause loops. Instead we tag every wal record with the "node
id" of the site that caused the change to happen and changes with a nodes own
"node id" won't get applied again.

Currently filtered records get simply replaced by NOOP records and loads of
zeroes which obviously is not a sensible solution. The difficulty of actually
removing the records is that that would change the LSNs. We currently rely on
those though.

The filtering might very well get expanded to support partial replication and
such in future.

3.:

To sensibly apply changes out of the WAL stream we need to solve two things:
Reassemble transactions and apply them to the target database.

The logical stream from 1. via 2. consists out of individual changes identified
by the relfilenode of the table and the xid of the transaction. Given
(sub)transactions, rollbacks, crash recovery, subtransactions and the like
those changes obviously cannot be individually applied without fully loosing
the pretence of consistency. To solve that we introduced a module, dubbed
ApplyCache which does the reassembling. This module is *independent* of the
data source and of the method of applying changes so it can be reused for
replicating into a foreign system or similar.

Due to the overhead of planner/executor/toast reassembly/type conversion (yes,
we benchmarked!) we decided against statement generation for apply. Even when
using prepared statements the overhead is rather noticeable.

Instead we decided to use relatively lowlevel heapam.h/genam.h accesses to do
the apply. For now we decided to use only one process to do the applying,
parallelizing that seems to be too complex for an introduction of an already
complex feature.
In our tests the apply process could keep up with pgbench -c/j 20+ generating
changes. This will obviously heavily depend on the workload. A fully seek bound
workload will definitely not scale that well.

Just to reiterate: Plugging in another method to do the apply should be a
relatively simple matter of setting up three callbacks to a different function
(begin, apply_change, commit).

Another complexity in this is how to synchronize the catalogs. We plan to use
command/event triggers and the oid preserving features from pg_upgrade to keep
the catalogs in-sync. We did not start working on that.

4.:

While we started to think about conflict resolution/avoidance we did not start
to work on it. We currently *cannot* handle conflicts. We think that the base
features/architecture should be aggreed uppon before starting with it.

Multimaster tests were done with sequences setup with INCREMENT 2 and different
start values on the two nodes.

=== Current Prototype ===

The current prototype consists of a series of patches that are split in
hopefully sensible and coherent parts to make reviewing of individual parts
possible.

Its also available in the 'cabal-rebasing' branch on
git.postgresql.org/users/andresfreund/postgres.git . That branch will modify
history though.

01: wakeup handling: reduces replication lag, not very interesting in this context

02: Add zeroRecPtr: not very interesting either

03: new syscache for relfilenode. This would benefit by some syscache experienced eyes

04: embedded lists: This is a general facility, general review appreciated

05: preliminary bgworker support: This is not ready and just posted as its
preliminary work for the other patches. Simon will post a real patch soon

06: XLogReader: Review definitely appreciated

07: logical data additions for WAL: Review definitely appreciated, I do not expect fundamental changes

08: ApplyCache: Important infrastructure for the patch, review definitely appreciated

09: Wal Decoding: Decode WAL generated with wal_level=logical into an ApplyCache

10: WAL with 'origin node': This is another important base-piece for logical rep

11: WAL segment handling changes: If the basic idea of adding a node_id to the
functions and adding a pg_lcr directory is acceptable the rest of the patch is
fairly boring/mechanical

12: walsender/walreceiver changes: Implement transport/filtering of logical
changes. Very relevant

13: shared memory/crash recovery state handling for logical rep: Very relevant
minus the TODO's in the commit message

14: apply module: review appreciated

15: apply process: somewhat dependent on the preliminary changes in 05, general
direction is visible, loads of detail work needed as soon as some design
decisions are agreed uppon.

16: this document. Not very interesting after youve read it ;)

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2012-06-13 11:28:32 [PATCH 01/16] Overhaul walsender wakeup handling
Previous Message Peter Eisentraut 2012-06-13 10:53:03 Re: initdb and fsync