Re: Speedup twophase transactions

From: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To: Nikhil Sontakke <nikhils(at)2ndquadrant(dot)com>
Cc: Stas Kelvich <s(dot)kelvich(at)postgrespro(dot)ru>, Simon Riggs <simon(at)2ndquadrant(dot)com>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>, Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>
Subject: Re: Speedup twophase transactions
Date: 2017-01-24 06:42:57
Message-ID: CAB7nPqQ=U0HY_V77LMCyCnEFQtPqn2WdDB2q2RDfsNyq8tGa2Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Jan 23, 2017 at 9:00 PM, Nikhil Sontakke
<nikhils(at)2ndquadrant(dot)com> wrote:
> Hi Stas,
>
>>
>> So, here is brand new implementation of the same thing.
>>
>> Now instead of creating pgproc entry for prepared transaction during recovery,
>> I just store recptr/xid correspondence in separate 2L-list and
>> deleting entries in that
>> list if redo process faced commit/abort. In case of checkpoint or end
>> of recovery
>> transactions remaining in that list are dumped to files in pg_twophase.
>>
>> Seems that current approach is way more simpler and patch has two times less
>> LOCs then previous one.
>>
>
> The proposed approach and patch does appear to be much simpler than
> the previous one.
>
> I have some comments/questions on the patch (twophase_recovery_list_2.diff):
>
> 1)
> + /*
> + * Move prepared transactions from KnownPreparedList to files, if any.
> + * It is possible to skip that step and teach subsequent code about
> + * KnownPreparedList, but whole PrescanPreparedTransactions() happens
> + * once during end of recovery or promote, so probably it isn't worth
> + * complications.
> + */
> + KnownPreparedRecreateFiles(InvalidXLogRecPtr);
> +
>
> Speeding up recovery or failover activity via a faster promote is a
> desirable thing. So, maybe, we should look at teaching the relevant
> code about using "KnownPreparedList"? I know that would increase the
> size of this patch and would mean more testing, but this seems to be
> last remaining optimization in this code path.

That's a good idea, worth having in this patch. Actually we may not
want to call KnownPreparedRecreateFiles() here as promotion is not
synonym of end-of-recovery checkpoint for a couple of releases now.

> 3) We are pushing the fsyncing of 2PC files to the checkpoint replay
> activity with this patch. Now, typically, we would assume that PREPARE
> followed by COMMIT/ABORT would happen within a checkpoint replay
> cycle, if not, this would make checkpoint replays slower since the
> earlier spread out fsyncs are now getting bunched up. I had concerns
> about this but clearly your latest performance measurements don't show
> any negatives around this directly.

Most of the 2PC syncs just won't happen, such transactions normally
don't last long, and the number you would get during a checkpoint is
largely lower than what would happen between two checkpoints. When
working on Postgres-XC, the number of 2PC was really more funky.

> 6) Do we need to hold TwoPhaseStateLock lock in shared mode while
> calling KnownPreparedRecreateFiles()? I think, it does not matter in
> the recovery/replay code path.

It does not matter as the only code path tracking that would be the
checkpointer in TwoPhaseCheckpoint(), and this 2PC recovery patch does
not touch anymore TwoPhaseStateData. Actually the patch makes no use
of it.

> 7) I fixed some typos and comments. PFA, patch attached.
>
> Other than this, I ran TAP tests and they succeed as needed.
>
> On 23 January 2017 at 16:56, Stas Kelvich <s(dot)kelvich(at)postgrespro(dot)ru> wrote:
>> I did tests with following setup:
>>
>> * Start postgres initialised with pgbench
>> * Start pg_receivexlog
>> * Take basebackup
>> * Perform 1.5 M transactions
>> * Stop everything and apply wal files stored by pg_receivexlog to base backup.
>>
>> All tests performed on a laptop with nvme ssd
>> number of transactions: 1.5M
>> start segment: 0x4
>>
>> -master non-2pc:
>> last segment: 0x1b
>> average recovery time per 16 wal files: 11.8s
>> average total recovery time: 17.0s
>>
>> -master 2pc:
>> last segment: 0x44
>> average recovery time per 16 wal files: 142s
>> average total recovery time: 568s
>>
>> -patched 2pc (previous patch):
>> last segment: 0x44
>> average recovery time per 16 wal files: 5.3s
>> average total recovery time: 21.2s
>>
>> -patched2 2pc (dlist_push_tail changed to dlist_push_head):
>> last segment: 0x44
>> average recovery time per 16 wal files: 5.2s
>> average total recovery time: 20.8s

The difference between those two is likely noise.

By the way, in those measurements, the OS cache is still filled with
the past WAL segments, which is a rather best case, no? What happens
if you do the same kind of tests on a box where memory is busy doing
something else and replayed WAL segments get evicted from the OS cache
more aggressively once the startup process switches to a new segment?
This could be tested for example on a VM with few memory (say 386MB or
less) so as the startup process needs to access again the past WAL
segments to recover the 2PC information it needs to get them back
directly from disk... One trick that you could use here would be to
tweak the startup process so as it drops the OS cache once a segment
is finished replaying, and see the effects of an aggressive OS cache
eviction. This patch is showing really nice improvements with the OS
cache backing up the data, still it would make sense to test things
with a worse test case and see if things could be done better. The
startup process now only reads records sequentially, not randomly
which is a concept that this patch introduces.

Anyway, perhaps this does not matter much, the non-recovery code path
does the same thing as this patch, and the improvement is too much to
be ignored. So for consistency's sake we could go with the approach
proposed which has the advantage to not put any restriction on the
size of the 2PC file contrary to what an implementation saving the
contents of the 2PC files into memory would need to do.

>> So skipping unnecessary fsyncs gave us x25 speed increase even on ssd (on hdd difference should be bigger).
>> Pushing to list's head didn’t yield measurable results, but anyway seems to be conceptually better.

That's in the logic of more recent segments getting priority as
startup process reads the records sequentially, so it is definitely
better.
--
Michael

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Etsuro Fujita 2017-01-24 06:49:32 Re: Odd plan shown in src/backend/optimizer/README
Previous Message Amit Langote 2017-01-24 06:35:34 Re: contrib modules and relkind check