Quick Links

Re: [RFC][PATCH] wal decoding, attempt #2 - Design Documents (really attached)

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Cc:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>
Subject:	Re: [RFC][PATCH] wal decoding, attempt #2 - Design Documents (really attached)
Date:	2012-10-11 11:42:29
Message-ID:	201210111342.29404.andres@2ndquadrant.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Thursday, October 11, 2012 09:15:47 AM Heikki Linnakangas wrote:
> On 22.09.2012 20:00, Andres Freund wrote:
> > [[basic-schema]]
> > .Architecture Schema
> > ["ditaa"]
> > -------------------------------------------------------------------------
> > -----
> >
> > Traditional Stuff
> >
> > +---------+---------+---------+---------+----+
> >
> > | Backend | Backend | Backend | Autovac | ...|
> >
> > +----+----+---+-----+----+----+----+----+-+--+
> >
> > +------+ | +--------+ | |
> >
> > +-+ | | | +----------------+ |
> >
> > | v v v v |
> > |
> > | +------------+ |
> > |
> > | | WAL writer |<------------------+
> > |
> > | +------------+
> >
> > v v v v v v +-------------------+
> >
> > +--------+ +---------+ +->| Startup/Recovery |
> >
> > |{s} | |{s} | | +-------------------+
> > |Catalog | | WAL |---+->| SR/Hot Standby |
> > |
> > | | | | | +-------------------+
> >
> > +--------+ +---------+ +->| Point in Time |
> >
> > ^ | +-------------------+
> >
> > ---|----------|--------------------------------
> >
> > | New Stuff
> >
> > +---+ |
> >
> > | v Running separately
> > |
> > | +----------------+ +=-------------------------+
> > |
> > | | Walsender | | | |
> > | |
> > | | v | | +-------------------+ |
> > |
> > | +-------------+ | | +->| Logical Rep. | |
> > |
> > | | WAL | | | | +-------------------+ |
> >
> > +-| decoding | | | +->| Multimaster | |
> >
> > | +------+------/ | | | +-------------------+ |
> > |
> > | | | | | +->| Slony | |
> > | |
> > | | v | | | +-------------------+ |
> > |
> > | +-------------+ | | +->| Auditing | |
> > |
> > | | TX | | | | +-------------------+ |
> >
> > +-| reassembly | | | +->| Mysql/... | |
> >
> > | +-------------/ | | | +-------------------+ |
> > |
> > | | | | | +->| Custom Solutions | |
> > | |
> > | | v | | | +-------------------+ |
> > |
> > | +-------------+ | | +->| Debugging | |
> > |
> > | | Output | | | | +-------------------+ |
> >
> > +-| Plugin |--|--|-+->| Data Recovery | |
> >
> > +-------------/ | | +-------------------+ |
> >
> > +----------------+ +--------------------------|
> >
> > -------------------------------------------------------------------------
> > -----
>
> This diagram triggers a pet-peeve of mine: What do all the boxes and
> lines mean? An architecture diagram should always include a key. I find
> that when I am drawing a diagram myself, adding the key clarifies my own
> thinking too.
Hm. Ok.

> This looks like a data-flow diagram, where the arrows indicate the data
> flows between components, and the boxes seem to represent processes. But
> in that case, I think the arrows pointing from the plugins in walsender
> to Catalog are backwards. The catalog information flows from the Catalog
> to walsender, walsender does not write to the catalogs.
The reason I drew it that way is that it actively needs to go back to the
catalog and query it which is somewhat different of the rest which basically
could be seen as a unidirectional pipeline.

> Zooming out to look at the big picture, I think the elephant in the room
> with this whole effort is how it fares against trigger-based
> replication. You list a number of disadvantages that trigger-based
> solutions have, compared to the proposed logical replication. Let's take
> > a closer look at them:

> > * essentially duplicates the amount of writes (or even more!)
> True.
By now I think its essentially unfixable.

> > * synchronous replication hard or impossible to implement
> > I don't see any explanation it could be implemented in the proposed
> logical replication either.
Its basically the same as its for synchronous streaming repl. At the place
where SyncRepWaitForLSN() is done you instead/also wait for the decoding to
reach that lsn (its in the wal, so everything is decodable) and for the other
side to have confirmed reception of those changes. I think this should be
doable with only minor code modifications.

The existing support for all that is basically the reason we want to reuse the
walsender framework. (will start a thread about that soon)

> > * noticeable CPU overhead
> >
> > * trigger functions
> > * text conversion of data
>
> Well, I'm pretty sure we could find some micro-optimizations for these
> if we put in the effort.
Any improvements there are a good idea independent from this proposal but I
don't see how we can fundamentally improve from the status quo.

> And the proposed code isn't exactly free, either.
If you don't have frequent DDL its really not all that expensive. In the
version without DDL support I didn't manage to saturate the ApplyCache with
either parallel COPY in individual transactions (repeated 100MB files) or with
pgbench.
Also its basically doing work that the trigger/queue based solutions have to do
as well, just that they do it via far less optimized sql statements.

DDL support doesn't really change much as the overhead for transactions without
DDL and without concurrently running DDL should be fairly minor (the submitted
version is *not* finialized there, it builds a new snapshot instead of
copying/referencing the old one).

> > * complex parts implemented in several solutions
> Not sure what this means, but the proposed code is quite complex too.
It is, agreed.

What I mean is that significantly complex logic is burried in the encoding,
queuing and decoding/ordering logic of every trigger based replication. Thats
not a good thing.

> > * not in core
>
> IMHO that's a good thing, and I'd hope this new logical replication to
> live outside core as well, as much as possible.
I don't agree there, but I would like to keep that a separate discussion.

For now I/we only want to submit the changes that technically need in-core
support to work sensibly (this, background workers, some walsender
integration). The goal of working nearly completely without special in-core
support held the existing solutions back quite a bit imo.

> But whether or not something is in core is just a political decision, not a
> reason to implement something new.
Isn't it both? There are things you simply cannot do unless youre inside core.

Politically I think the external status of all those logical replication
projects grew to be an adoption barrier. I don't even want to think about how
many bad home-grown logical replication solutions I have seen out there that
implement everything from the get-go.

> If the only meaningful advantage is reducing the amount of WAL written,
> I can't help thinking that we should just try to address that in the
> existing solutions, even if it seems "easy to solve at a first glance,
> but a solution not using a normal transactional table for its log/queue
> has to solve a lot of problems", as the document says.
Youre welcome to make suggestions, but everything I could think of that didn't
fall short of reality ended up basically duplicating the amount of writes &
fsyncs, even if not going through the WAL.

You need to be crash safe/restartable (=> writes, fsyncs) and you need to
reduce the writes (in memory, => !writes). There is only one authoritative
point where you can rely on a commit to have been successfull and thats when
the commit record has been written to the WAL. You can't send out the data to
be committed before thats written because that could result in spuriously
committed transactions on the remote side and you can't easily do it afterwards
because you can crash after the commit.

> Sorry to be a naysayer, but I'm pretty scared of all the new code and
> complexity these patches bring into core.
Understandable. I tried to keep the introduction of complexity in existing code
paths relatively minor and I think I mostly succeeded there but it still needs
to be maintained.

> PS. I'd love to see a basic Slony plugin for this, for example, to see
> how much extra code on top of the posted patches you need to write in a
> plugin like that to make it functional. I'm worried that it's a lot..
I think before its possible to do something like that a bit more design
decisions need to be made. Mostly the walsender(ish) integration needs to be
done.

After that I can imagine writing a demo plugin that outputs changes in a slony
compatible format, but I would like to see some slony/londiste person
cooperating on receiving/applying those.

What complications are you imagining?

Greetings,

Andres
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

In response to

Re: [RFC][PATCH] wal decoding, attempt #2 - Design Documents (really attached) at 2012-10-11 07:15:47 from Heikki Linnakangas

Responses

Re: [RFC][PATCH] wal decoding, attempt #2 - Design Documents (really attached) at 2012-10-15 18:38:07 from Hannu Krosing

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Peter Eisentraut	2012-10-11 13:05:22	Windows help needed for flex and bison
Previous Message	Pavel Stehule	2012-10-11 11:18:02	Re: Is there a good reason why PL languages do not support cstring type arguments and return values ?