Re: [RFC][PATCH] Logical Replication/BDR prototype and architecture

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: [RFC][PATCH] Logical Replication/BDR prototype and architecture
Date: 2012-06-15 20:03:38
Message-ID: CA+Tgmoby-5VO7uXnAmnz02JproSiZ38tg5gbp73QVbzFKtEaKg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Jun 14, 2012 at 4:13 PM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> I don't plan to throw in loads of conflict resolution smarts. The aim is to get
> to the place where all the infrastructure is there so that a MM solution can
> be built by basically plugging in a conflict resolution mechanism. Maybe
> providing a very simple one.
> I think without in-core support its really, really hard to build a sensible MM
> implementation. Which doesn't mean it has to live entirely in core.

Of course, several people have already done it, perhaps most notably Bucardo.

Anyway, it would be good to get opinions from more people here. I am
sure I am not the only person with an opinion on the appropriateness
of trying to build a multi-master replication solution in core or,
indeed, the only person with an opinion on any of these other issues.
It is not good for those other opinions to be saved for a later date.

> Hm. Yes, you could do that. But I have to say I don't really see a point.
> Maybe the fact that I do envision multimaster systems at some point is
> clouding my judgement though as its far less easy in that case.

Why? I don't think that particularly changes anything.

> It also complicates the wal format as you now need to specify whether you
> transport a full or a primary-key only tuple...

Why? If the schemas are in sync, the target knows what the PK is
perfectly well. If not, you're probably in trouble anyway.

> I think though that we do not want to enforce that mode of operation for
> tightly coupled instances. For those I was thinking of using command triggers
> to synchronize the catalogs.
> One of the big screwups of the current replication solutions is exactly that
> you cannot sensibly do DDL which is not a big problem if you have a huge
> system with loads of different databases and very knowledgeable people et al.
> but at the beginning it really sucks. I have no problem with making one of the
> nodes the "schema master" in that case.
> Also I would like to avoid the overhead of the proxy instance for use-cases
> where you really want one node replicated as fully as possible with the slight
> exception of being able to have summing tables, different indexes et al.

In my view, a logical replication solution is precisely one in which
the catalogs don't need to be in sync. If the catalogs have to be in
sync, it's not logical replication. ISTM that what you're talking
about is sort of a hybrid between physical replication (pages) and
logical replication (tuples) - you want to ship around raw binary
tuple data, but not entire pages. The problem with that is it's going
to be tough to make robust. Users could easily end up with answers
that are total nonsense, or probably even crash the server.

To step back and talk about DDL more generally, you've mentioned a few
times the idea of using an SR instance that has been filtered down to
just the system catalogs as a means of generating logical change
records. However, as things stand today, there's no reason to suppose
that replicating anything less than the entire cluster is sufficient.
For example, you can't translate enum labels to strings without access
to the pg_enum catalog, which would be there, because enums are
built-in types. But someone could supply a similar user-defined type
that uses a user-defined table to do those lookups, and now you've got
a problem. I think this is a contractual problem, not a technical
one. From the point of view of logical replication, it would be nice
if type output functions were basically guaranteed to look at nothing
but the datum they get passed as an argument, or at the very least
nothing other than the system catalogs, but there is no such
guarantee. And, without such a guarantee, I don't believe that we can
create a high-performance, robust, in-core replication solution.

Now, the nice thing about being the people who make PostgreSQL happen
is we get to decide what the C code that people load into the server
is required to guarantee; we can change the rules. Before, types were
allowed to do X, but now they're not. Unfortunately, in this case, I
don't really find that an acceptable solution. First, it might break
code that has worked with PostgreSQL for many years; but worse, it
won't break it in any obvious way, but rather only if you're using
logical replication, which will doubtless cause people to attribute
the failure to logical replication rather than to their own code.
Even if they do understand that we imposed a rule-change from on high,
there's no really good workaround: an enum type is a good example of
something that you *can't* implement without a side-table. Second, it
flies in the face of our often-stated desire to make the server
extensible. Also, even given the existence of such a restriction, you
still need to run any output function that relies on catalogs with
catalog contents that match what existed at the time that WAL was
generated, and under the correct snapshot, which is not trivial.
These things are problems even for other things that we might need to
do while examining the WAL stream, but they're particularly acute for
any application that wants to run type-output functions to generate
something that can be sent to a server which doesn't necessarily
having matching catalog contents.

But it strikes me that these things, really, are only a problem for a
minority of data types. For text, or int4, or float8, or even
timestamptz, we don't need *any catalog contents at all* to
reconstruct the tuple data. Knowing the correct type alignment and
which C function to call is entirely sufficient. So maybe instead of
trying to cobble together a set of catalog contents that we can use
for decoding any tuple whatsoever, we should instead divide the world
into well-behaved types and poorly-behaved types. Well-behaved types
are those that can be interpreted without the catalogs, provided that
you know what type it is. Poorly-behaved types (records, enums) are
those where you can't. For well-behaved types, we only need a small
amount of additional information in WAL to identify which types we're
trying to decode (not the type OID, which might fail in the presence
of nasty catalog hacks, but something more universal, like a UUID that
means "this is text", or something that identifies the C entrypoint).
And then maybe we handle poorly-behaved types by pushing some of the
work into the foreground task that's generating the WAL: in the worst
case, the process logs a record before each insert/update/delete
containing the text representation of any values that are going to be
hard to decode. In some cases (e.g. records all of whose constituent
fields are well-behaved types) we could instead log enough additional
information about the type to permit blind decoding.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Euler Taveira 2012-06-15 20:04:31 Re: libpq compression
Previous Message Marko Kreen 2012-06-15 18:21:51 [patch] libpq one-row-at-a-time API