Re: Speedup twophase transactions

From: Nikhil Sontakke <nikhils(at)2ndquadrant(dot)com>
To: Stas Kelvich <s(dot)kelvich(at)postgrespro(dot)ru>
Cc: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>, Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>
Subject: Re: Speedup twophase transactions
Date: 2017-01-23 12:00:59
Message-ID: CAMGcDxf8Bn9ZPBBJZba9wiyQq-Qk5uqq=VjoMnRnW5s+fKST3w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi Stas,

>
> So, here is brand new implementation of the same thing.
>
> Now instead of creating pgproc entry for prepared transaction during recovery,
> I just store recptr/xid correspondence in separate 2L-list and
> deleting entries in that
> list if redo process faced commit/abort. In case of checkpoint or end
> of recovery
> transactions remaining in that list are dumped to files in pg_twophase.
>
> Seems that current approach is way more simpler and patch has two times less
> LOCs then previous one.
>

The proposed approach and patch does appear to be much simpler than
the previous one.

I have some comments/questions on the patch (twophase_recovery_list_2.diff):

1)
+ /*

+ * Move prepared transactions from KnownPreparedList to files, if any.

+ * It is possible to skip that step and teach subsequent code about

+ * KnownPreparedList, but whole PrescanPreparedTransactions() happens

+ * once during end of recovery or promote, so probably it isn't worth

+ * complications.

+ */

+ KnownPreparedRecreateFiles(InvalidXLogRecPtr);

+

Speeding up recovery or failover activity via a faster promote is a
desirable thing. So, maybe, we should look at teaching the relevant
code about using "KnownPreparedList"? I know that would increase the
size of this patch and would mean more testing, but this seems to be
last remaining optimization in this code path.

2)
+ /*

+ * Here we know that file should be moved to disk. But
aborting recovery because

+ * of absence of unnecessary file doesn't seems to be a good
idea, so call remove

+ * with giveWarning=false.

+ */

+ RemoveTwoPhaseFile(xid, false);

We are going to call the above in case of COMMIT/ABORT. If so, we
should always find the "xid" entry either in the KnownPreparedList or
as a file. Does it make sense to call the above function with "false"
then?

3) We are pushing the fsyncing of 2PC files to the checkpoint replay
activity with this patch. Now, typically, we would assume that PREPARE
followed by COMMIT/ABORT would happen within a checkpoint replay
cycle, if not, this would make checkpoint replays slower since the
earlier spread out fsyncs are now getting bunched up. I had concerns
about this but clearly your latest performance measurements don't show
any negatives around this directly.

4) Minor nit-pick on existing code.

(errmsg_plural("%u two-phase state file was written "
"for
long-running prepared transactions",
"%u
two-phase state files were written "
"for
long-running prepared transactions",
serialized_xacts,
serialized_xacts)

Shouldn’t the singular part of the message above be:
"%u two-phase state file was written for a long-running prepared transaction"

But, then, English is not my native language, so I might be off here :-)

5) Why isn't KnownPreparedRecreateFiles() tracking which entries from
KnownPreparedList have been written to disk in an earlier iteration
like in the normal code path? Why is it removing the entry from
KnownPreparedList and not just marking it as saved on disk?

6) Do we need to hold TwoPhaseStateLock lock in shared mode while
calling KnownPreparedRecreateFiles()? I think, it does not matter in
the recovery/replay code path.

7) I fixed some typos and comments. PFA, patch attached.

Other than this, I ran TAP tests and they succeed as needed.

Regards,
Nikhils

On 23 January 2017 at 16:56, Stas Kelvich <s(dot)kelvich(at)postgrespro(dot)ru> wrote:
>
>> On 27 Dec 2016, at 07:31, Michael Paquier <michael(dot)paquier(at)gmail(dot)com> wrote:
>>
>> I think that it would be a good idea to actually test that in pure
>> recovery time, aka no client, and just use a base backup and make it
>> recover X prepared transactions that have created Y checkpoints after
>> dropping cache (or restarting server).
>
> I did tests with following setup:
>
> * Start postgres initialised with pgbench
> * Start pg_receivexlog
> * Take basebackup
> * Perform 1.5 M transactions
> * Stop everything and apply wal files stored by pg_receivexlog to base backup.
>
> All tests performed on a laptop with nvme ssd
> number of transactions: 1.5M
> start segment: 0x4
>
> -master non-2pc:
> last segment: 0x1b
> average recovery time per 16 wal files: 11.8s
> average total recovery time: 17.0s
>
> -master 2pc:
> last segment: 0x44
> average recovery time per 16 wal files: 142s
> average total recovery time: 568s
>
> -patched 2pc (previous patch):
> last segment: 0x44
> average recovery time per 16 wal files: 5.3s
> average total recovery time: 21.2s
>
> -patched2 2pc (dlist_push_tail changed to dlist_push_head):
> last segment: 0x44
> average recovery time per 16 wal files: 5.2s
> average total recovery time: 20.8s
>
> So skipping unnecessary fsyncs gave us x25 speed increase even on ssd (on hdd difference should be bigger).
> Pushing to list's head didn’t yield measurable results, but anyway seems to be conceptually better.
>
> PS:
> I’ve faced situation when pg_basebackup freezes until checkpoint happens (automatic or user-issued).
> Is that expected behaviour?
>
> --
> Stas Kelvich
> Postgres Professional: http://www.postgrespro.com
> The Russian Postgres Company
>
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>

Attachment Content-Type Size
twophase_recovery_list_2_23017.patch application/octet-stream 20.1 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Ashutosh Bapat 2017-01-23 12:18:37 Re: Assignment of valid collation for SET operations on queries with UNKNOWN types.
Previous Message Ivan Kartyshov 2017-01-23 11:56:36 Re: make async slave to wait for lsn to be replayed