Re: Speedup twophase transactions

From: Stas Kelvich <s(dot)kelvich(at)postgrespro(dot)ru>
To: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>, Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>
Subject: Re: Speedup twophase transactions
Date: 2016-09-20 14:13:19
Message-ID: 335982D0-4DCC-47B5-9028-F391E0993F52@postgrespro.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

>
> On 20 Sep 2016, at 09:40, Michael Paquier <michael(dot)paquier(at)gmail(dot)com> wrote:
>

Thanks for digging into this.

> + StartupCLOG();
> + StartupSUBTRANS(checkPoint.oldestActiveXid);
> + RecoverPreparedFromFiles();
> [...]
> /*
> - * Startup commit log and subtrans only. MultiXact and commit
> - * timestamp have already been started up and other SLRUs are not
> - * maintained during recovery and need not be started yet.
> - */
> - StartupCLOG();
> - StartupSUBTRANS(oldestActiveXID);
> Something is definitely wrong in this patch if we begin to do that.

Putting that before actual WAL replay is just following historical order of events.
Prepared files are pieces of WAL that happened before checkpoint, so ISTM
there is no conceptual problem in restoring their state before replay.

Moreover I think that this approach is better then oldest one.
There is kind of circular dependency between StartupSUBTRANS() and restoring
old prepared transaction: StartupSUBTRANS requires oldestActiveXID, but you
can get it only after recovering 2pc files, but that recovery requires working SUBTRANS.

In old code that was resolved by two passes through 2pc files: first one finds
oldestActiveXmin but doesn’t restore their state, then subtrans was started, and
only after that RecoverPreparedTransactions() is called. And that logic was repeated
three times in xlog.c with help of following functions:
PrescanPreparedTransactions() — 3 calls
StandbyRecoverPreparedTransactions() — 2 calls
RecoverPreparedTransactions() — 1 call

Now, since we know that 2pc files are written only on checkpoint we can boil all
that cases down to one: get oldestActiveXID from checkpoint and restore it before
WAL replay. So only one call of RecoverPreparedFromFiles() and one call of
CleanupPreparedTransactionsOnPITR() in case of PITR. And each of them does
only what stated on function name without side-effects like advancing nextXid in
previous versions.

So, to summarise, i think keeping old interfaces here is bigger evil because in each of
mentioned functions we will need to deal with both memory and file 2pc states.

> There is no need to move those two calls normally, and I think that we
> had better continue to do that only for hot standbys just to improve
> 2PC handling…

We can simulate old interface, it’s not a big problem. But that will complicate 2pc code and
keep useless code in xlog.c because it was written in assumption that 2pc file can be created
before checkpoint and now it isn’t true.

> CleanupPreparedTransactionsOnPITR() cleans up pg_twophase, but isn't
> it necessary to do something as well for what is in memory?

Yes, that is necessary. Thanks, will fix. And probably add prepared tx to PITR test.

> I have been thinking about this patch quite a bit, and the approach
> taken looks really over-complicated to me. Basically what we are
> trying to do here is to reuse as much as possible code paths that are
> being used by non-recovery code paths during recovery, and then try to
> adapt a bunch of things like SUBTRANS structures, CLOG, XID handling
> and PGXACT things in sync to handle the 2PC information in memory
> correctly. I am getting to think that this is *really* bug-prone in
> the future and really going to make maintenance hard.

With this patch we are reusing one infrastructure for normal work, recovery and replay.
I don’t think that we will win a lot reliability if we split that to a different code paths.

> Taking one step back, and I know that I am a noisy one, but looking at
> this patch is making me aware of the fact that it is trying more and
> more to patch things around the more we dig into issues, and I'd
> suspect that trying to adapt for example sub-transaction and clog
> handling is just the tip of the iceberg and that there are more things
> that need to be taken care of if we continue to move on with this
> approach.

Again, it isn’t patching around to fix issues, it’s totally possible to keep old interface.
However it’s possible that current approach is wrong because of some aspect
that i didn’t think of, but now I don’t see good counterarguments.

> Coming to the point: why don't we simplify things? In short:

> - Just use a static list and allocate a chunk of shared memory as
> needed. DSM could come into play here to adapt to the size of a 2PC
> status file, this way there is no need to allocate a chunk of memory
> with an upper bound. Still we could use an hardcoded upper-bound of
> say 2k with max_prepared_transactions, and just write the 2PC status
> file to disk if its size is higher than what can stick into memory.
> - save 2PC information in memory instead of writing it to a 2PC file
> when XLOG_XACT_PREPARE shows up.
> - flush information to disk when there is a valid restart point in
> RecoveryRestartPoint(). We would need as well to tweak
> PrescanPreparedTransactions accordingly than everything we are trying
> to do here.
> That would be way more simple than what's proposed here, and we'd
> still keep all the performance benefits.

So file arrives to replica through WAL and instead of directly reading it you suggest
to copy it do DSM if it is small enough, and to filesystem if not. Probably that will
allow us to stay only around reading/writing files, but:

* That will be measurably slower than proposed approach because of unnecessary
copying between WAL and DSM. Not to mention prepare records of arbitrary length.
* Different behaviour on replica and master — less tested code for replica.
* Almost the same behaviour can be achieved by delaying fsync on 2pc files till
checkpoint.

But if reword you comment in a way that it is better to avoid replaying prepare record at all,
like it happens now in master, then I see the point. And that is possible even without DSM, it possible
to allocate static sized array storing some info about tx, whether it is in the WAL or in file, xid, gid.
Some sort of PGXACT doppelganger only for replay purposes instead of using normal one.

So taking into account my comments what do you think? Should we keep current approach
or try to avoid replaying prepare records at all?

--
Stas Kelvich
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2016-09-20 14:46:39 Re: [PATCH] Transaction traceability - txid_status(bigint)
Previous Message Robert Haas 2016-09-20 13:41:06 Re: Declarative partitioning - another take