Re: [PATCH 08/16] Introduce the ApplyCache module which can reassemble transactions from a stream of interspersed changes

From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Steve Singer <steve(at)ssinger(dot)info>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: [PATCH 08/16] Introduce the ApplyCache module which can reassemble transactions from a stream of interspersed changes
Date: 2012-06-21 08:37:12
Message-ID: 201206211037.13230.andres@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi Steve,

On Thursday, June 21, 2012 02:16:57 AM Steve Singer wrote:
> On 12-06-13 07:28 AM, Andres Freund wrote:
> > From: Andres Freund<andres(at)anarazel(dot)de>
> >
> > The individual changes need to be identified by an xid. The xid can be a
> > subtransaction or a toplevel one, at commit those can be reintegrated by
> > doing a k-way mergesort between the individual transaction.
> >
> > Callbacks for apply_begin, apply_change and apply_commit are provided to
> > retrieve complete transactions.
> >
> > Missing:
> > - spill-to-disk
> > - correct subtransaction merge, current behaviour is simple/wrong
> > - DDL handling (?)
> > - resource usage controls
>
> Here is an initial review of the ApplyCache patch.
Thanks!

> This patch provides a module for taking actions in the WAL stream and
> groups the actions by transaction, then passing these change records to
> a set of plugin functions.
>
> For each transaction it encounters it keeps a list of the actions in
> that transaction. The ilist included in an earlier patch is used,
> changes resulting from that patch review would effect the code here but
> not in a way that chances the design. When the module sees a commit for
> a transaction it calls the apply_change callback for each change.
>
> I can think of three ways that a replication system like this could try
> to apply transactions.
>
> 1) Each time it sees a new transaction it could open up a new
> transaction on the replica and makes that change. It leaves the
> transaction open and goes on applying the next change (which might be
> for the current transaction or might be for another one).
> When it comes across a commit record it would then commit the
> transaction. If 100 concurrent transactions were open on the origin
> then 100 concurrent transactions will be open on the replica.
>
> 2) Determine the commit order of the transactions, group all the changes
> for a particular transaction together and apply them in that order for
> the transaction that committed first, commit that transaction and then
> move onto the transaction that committed second.
>
> 3) Group the transactions in a way that you move the replica from one
> consistent snapshot to another. This is what Slony and Londiste do
> because they don't have the commit order or commit timestamps. Built-in
> replication can do better.
>
> This patch implements option (2). If we had a way of implementing
> option (1) efficiently would we be better off?
> Option (2) requires us to put unparsed WAL data (HeapTuples) in the
> apply cache. You can't translate this to an independent LCR until you
> call the apply_change record (which happens once the commit is
> encountered). The reason for this is because some of the changes might
> be DDL (or things generated by a DDL trigger) that will change the
> translation catalog so you can't translate the HeapData to LCR's until
> your at a stage where you can update the translation catalog. In both
> cases you might need to see later WAL records before you can convert an
> earlier one into an LCR (ie TOAST).
Very good analysis, thanks!

Another reasons why we cannot easily do 1) is that subtransactions aren't
discernible from top-level transactions before the top-level commit happens,
we can only properly merge in the right order (by "sorting" via lsn) once we
have seen the commit record which includes a list of all committed
subtransactions.

I also don't think 1) would be particularly welcome by people trying to
replicate into foreign systems.

> Some of my concerns with the apply cache are
>
> Big transactions (bulk loads, mass updates) will be cached in the apply
> cache until the commit comes along. One issue Slony has todo with bulk
> operations is that the replicas can't start processing the bulk INSERT
> until after it has commited. If it takes 10 hours to load the data on
> the master it will take another 10 hours (at best) to load the data into
> the replica(20 hours after you start the process). With binary
> streaming replication your replica is done processing the bulk update
> shortly after the master is.
One reason why we have the low level apply stuff planned is that that way the
apply on the standby actually takes less time than on the master. If you do it
right there is significantly lower overhead there.
Still, a 10h bulk load will definitely cause problems. I don't think there is
a way around that for now.

> Long running transactions can sit in the cache for a long time. When
> you spill to disk we would want the long running but inactive ones
> spilled to disk first. This is solvable but adds to the complexity of
> this module, how were you planning on managing which items of the list
> get spilled to disk?
I planned to have some cutoff 'max_changes_in_memory_per_txn' value. If it has
been reached for one transaction all existing changes are spilled to disk. New
changes again can be kept in memory till its reached again.

We need to support serializing the cache for crash recovery + shutdown of the
receiving side as well. Depending on how we do the wal decoding we will need
it more frequently...

> The idea that we can safely reorder the commands into transactional
> groupings works (as far as I know) today because DDL commands get big
> heavy locks that are held until the end of the transaction. I think
> Robert mentioned earlier in the parent thread that maybe some of that
> will be changed one day.
I think we shouldn't worry about that overly much today.

> The downsides of (1) that I see are:
>
> We would want a single backend to keep open multiple transactions at
> once. How hard would that be to implement? Would subtransactions be good
> enough here?
Subtransactions wouldn't be good enough (they cannot ever be concurrent
anyway). Implementing multiple concurrent top-level transactions is a major
project on its own imo.

> Applying (or even translating WAL to LCR's) the changes in parallel
> across transactions might complicate the catalog structure because each
> concurrent transaction might need its own version of the catalog (or can
> you depend on the locking at the master for this? I think you can today)
The locking should be enough, yes.

> I think I need more convincing that approach (2), what this patch
> implements, is the best way doing things, compared (1). I will hold off
> on a more detailed review of the code until I get a better sense of if
> the design will change.
I don't think 1) is really possible without much, much more work.
* Multiple autonomous transactions in one backend are a major feature on its
own
* correct decoding of tuples likely requires some form of reassembling txns
* toast reassembly requires some minimal transaction reassembly
* subtransaction handling requires full transaction reassembly

I have some ideas how we can parallelize some cases of apply in the future but
imo thats an incremental improvement ontop of this.

Thanks for the review!

Andres

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2012-06-21 08:58:25 Re: Allow WAL information to recover corrupted pg_controldata
Previous Message Simon Riggs 2012-06-21 08:08:13 Re: Pruning the TODO list