Re: Speedup twophase transactions

From: Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>
To: Stas Kelvich <s(dot)kelvich(at)postgrespro(dot)ru>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speedup twophase transactions
Date: 2016-03-11 16:41:03
Message-ID: 56E2F51F.2030005@redhat.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 01/26/2016 07:43 AM, Stas Kelvich wrote:
> Thanks for reviews and commit!
>
> As Simon and Andres already mentioned in this thread replay of twophase transaction is significantly slower then the same operations in normal mode. Major reason is that each state file is fsynced during replay and while it is not a problem for recovery, it is a problem for replication. Under high 2pc update load lag between master and async replica is constantly increasing (see graph below).
>
> One way to improve things is to move fsyncs to restartpoints, but as we saw previously it is a half-measure and just frequent calls to fopen can cause bottleneck.
>
> Other option is to use the same scenario for replay that was used already for non-recovery mode: read state files to memory during replay of prepare, and if checkpoint/restartpoint occurs between prepare and commit move data to files. On commit we can read xlog or files. So here is the patch that implements this scenario for replay.
>
> Patch is quite straightforward. During replay of prepare records RecoverPreparedFromXLOG() is called to create memory state in GXACT, PROC, PGPROC; on commit XlogRedoFinishPrepared() is called to clean up that state. Also there are several functions (PrescanPreparedTransactions, StandbyTransactionIdIsPrepared) that were assuming that during replay all prepared xacts have files in pg_twophase, so I have extended them to check GXACT too.
> Side effect of that behaviour is that we can see prepared xacts in pg_prepared_xacts view on slave.
>
> While this patch touches quite sensible part of postgres replay and there is some rarely used code paths, I wrote shell script to setup master/slave replication and test different failure scenarios that can happened with instances. Attaching this file to show test scenarios that I have tested and more importantly to show what I didn’t tested. Particularly I failed to reproduce situation where StandbyTransactionIdIsPrepared() is called, may be somebody can suggest way how to force it’s usage. Also I’m not too sure about necessity of calling cache invalidation callbacks during XlogRedoFinishPrepared(), I’ve marked this place in patch with 2REVIEWER comment.
>
> Tests shows that this patch increases speed of 2pc replay to the level when replica can keep pace with master.
>
> Graph: replica lag under a pgbench run for a 200 seconds with 2pc update transactions (80 connections, one update per 2pc tx, two servers with 12 cores each, 10GbE interconnect) on current master and with suggested patch. Replica lag measured with "select sent_location-replay_location as delay from pg_stat_replication;" each second.
>

Some comments:

* The patch needs a rebase against the latest TwoPhaseFileHeader change
* Rework the check.sh script into a TAP test case (src/test/recovery),
as suggested by Alvaro and Michael down thread
* Add documentation for RecoverPreparedFromXLOG

+ * that xlog record. We need just to clen up memmory state.

'clean' + 'memory'

+ * This is usually called after end-of-recovery checkpoint, so all 2pc
+ * files moved xlog to files. But if we restart slave when master is
+ * switched off this function will be called before checkpoint ans we need
+ * to check PGXACT array as it can contain prepared transactions that
+ * didn't created any state files yet.

=>

"We need to check the PGXACT array for prepared transactions that
doesn't have any state file in case of a slave restart with the master
being off."

+ * prepare xlog resords in shared memory in the same way as it happens

'records'

+ * We need such behaviour because speed of 2PC replay on replica should
+ * be at least not slower than 2PC tx speed on master.

=>

"We need this behaviour because the speed of the 2PC replay on the
replica should be at least the same as the 2PC transaction speed of the
master."

I'll leave the 2REVIEWER section to Simon.

Best regards,
Jesper

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tomas Vondra 2016-03-11 16:49:03 Re: amcheck (B-Tree integrity checking tool)
Previous Message David Steele 2016-03-11 16:36:16 Re: Inconsistent error handling in START_REPLICATION command