Re: [PATCH 8/8] Introduce wal decoding via catalog timetravel

From: Hannu Krosing <hannu(at)krosing(dot)net>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Peter Geoghegan <peter(at)2ndquadrant(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org, hlinnakangas(at)vmware(dot)com
Subject: Re: [PATCH 8/8] Introduce wal decoding via catalog timetravel
Date: 2012-10-11 08:47:51
Message-ID: 507687B7.1050005@krosing.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 10/11/2012 03:10 AM, Robert Haas wrote:
> On Wed, Oct 10, 2012 at 7:02 PM, Peter Geoghegan <peter(at)2ndquadrant(dot)com> wrote:
>> The purpose of ApplyCache/transaction reassembly is to reassemble
>> interlaced records, and organise them by XID, so that the consumer
>> client code sees only streams (well, lists) of records split by XID.
> I think I've mentioned it before, but in the interest of not being
> seen to critique the bikeshed only after it's been painted: this
> design gives up something very important that exists in our current
> built-in replication solution, namely pipelining.
The lack of pipelining (and the following complexity of applycache
and spilling to disk) is something we have discussed with Andres and
to my understanding it is not a final design decision but just stepping
stones in how this quite large development is structured.

The pipelining (or parallel apply as I described it) requires either a
large
number of apply backends and code to manage them or autonomous
transactions.

It could (arguably !) be easier to implement autonomous transactions
instead of apply cache, but Andres had valid reasons to start with apply
cache and move to parallel apply later .

As I understand it the parallel apply is definitely one of the things that
will be coming and after that the performance characteristics (fast AND
smooth) will be very similar to current physical WAL streaming.

> With streaming
> replication as it exists today, a transaction that modifies a huge
> amount of data (such as a bulk load) can be applied on the standby as
> it happens. The rows thus inserted will become visible only if and
> when the transaction commits on the master and the commit record is
> replayed on the standby. This has a number of important advantages,
> perhaps most importantly that the lag between commit and data
> visibility remains short. With the proposed system, we can't start
> applying the changes until the transaction has committed and the
> commit record has been replayed, so a big transaction is going to have
> a lot of apply latency.
>
> Now, I am not 100% opposed to a design that surrenders this property
> in exchange for other important benefits, but I think it would be
> worth thinking about whether there is any way that we can design this
> that either avoids giving that property up at all, or gives it up for
> the time being but allows us to potentially get back to it in a later
> version. Reassembling complete transactions is surely cool and some
> clients will want that, but being able to apply replicated
> transactions *without* reassembling them in their entirety is even
> cooler, and some clients will want that, too.
>
> If we're going to stick with a design that reassembles transactions, I
> think there are a number of issues that deserve careful thought.
> First, memory usage. I don't think it's acceptable for the decoding
> process to assume that it can allocate enough backend-private memory
> to store all of the in-flight changes (either as WAL or in some more
> decoded form). We have assiduously avoided such assumptions thus far;
> you can write a terabyte of data in one transaction with just a
> gigabyte of shared buffers if you so desire (and if you're patient).
> Here's you making the same point in different words:
>
>> Applycache is presumably where you're going to want to spill
>> transaction streams to disk, eventually. That seems like a
>> prerequisite to commit.
> Second, crash recovery. I think whatever we put in place here has to
> be able to survive a crash on any node. Decoding must be able to
> restart successfully after a system crash, and it has to be able to
> apply exactly the set of transactions that were committed but not
> applied prior to the crash. Maybe an appropriate mechanism for this
> already exists or has been discussed, but I haven't seen it go by;
> sorry if I have missed the boat.
>
>> You consider this to be a throw-away function that won't ever be
>> committed. However, I strongly feel that you should move it into
>> /contrib, so that it can serve as a sort of reference implementation
>> for authors of decoder client code, in the same spirit as numerous
>> existing contrib modules (think contrib/spi).
> Without prejudice to the rest of this review which looks quite
> well-considered, I'd like to add a particular +1 to this point.
>

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Simon Riggs 2012-10-11 08:57:28 Re: change in LOCK behavior
Previous Message Hannu Krosing 2012-10-11 08:27:23 Re: [PATCH 8/8] Introduce wal decoding via catalog timetravel