Nested transaction proposal - take N (N > 2)

From: Alvaro Herrera <alvherre(at)dcc(dot)uchile(dot)cl>
To: Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Nested transaction proposal - take N (N > 2)
Date: 2004-03-23 18:59:07
Message-ID: 20040323185907.GH3863@dcc.uchile.cl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hackers,

I'm (again) trying to get my hands dirty on the nested transaction
thingie. Hopefully I will have a lot more commitment this time ...

I'm currently looking at whatever is missing in the description below.
If you see anything that isn't being considered please let me know.

--
Alvaro Herrera (<alvherre[a]dcc.uchile.cl>)
"Uno puede defenderse de los ataques; contra los elogios se esta indefenso"

Request For Comments -- Nested Transactions

Topics:
- UNDO? Multiple XIDs?
- Tuple Visibility
- Memory Management
- Lock management
- File deletions, Deferred triggers, Asynchronous notifies

UNDO
----

There was a proposal for implementing the UNDO functionality to allow an
implementation of an overwriting storage manager, mainly by Vadim
Mikheev. This would quasi-automatically give the ability to support
subtransactions and savepoints. Not much people seemed to like the
idea, and everybody agreed that there was no one willing to implement
it. Also, the UNDO idea had its own problems and we don't seem to be
moving in the overwriting storage manager anyway. So UNDO is discarded
from the beggining.

The other way to implement subtransactions seems to be using multiple
transaction IDs, assigning a new one to each subxact. This has its own
issues, which I discuss below.

Vadim seemed to think that the transaction mechanism should be rethought
completely instead of "hacked". If someone has better ideas on how this
could be done, I'm all ears.

Tuple Visibility
----------------

The main idea is that a transaction tree will have several XIDs; thus,
some tuples written by it will have the XID of the main transaction in
xmin/xmax, but other tuples can have different XIDs. The important
point is that the final commit has to be atomic, that is, when the main
transaction commits, all tuples created by it will be seen as committed.

There would be a new storage area, called pg_subtrans, where the XID of
the parent transaction of each subtransaction would be stored
permanently, in a way similar to that of pg_clog.

The last time this issue was discussed, Manfred Koizar proposed the
following mechanism:

Let the currently unused fourth state in pg_clog indicate a
committed subtransaction. In pg_clog there are two bits per
transaction, commit and abort, with the following meaning:

a c
0 0 transaction in progress, the owning backend knows whether
it is a main- or a sub-transaction, other backends don't care
1 0 aborted, nobody cares whether main- or sub-transaction
0 1 committed main-transaction or - with shortcut 2 - a sub-
transaction that's known committed to all active transactions
1 1 committed sub-transaction, have to look for parent in
pg_subtrans

http://archives.postgresql.org/pgsql-hackers/2003-03/msg00754.php

The other, older proposal (mentioned in the same article), was to use
the 00 status indicate both a transaction in progress and a
subtransaction whose parent we don't know whether it committed or rolled
back (i.e. have to check). The main problem I see with this is that
every 00 status will have to be rechecked in pg_subtrans and pg_clog,
creating a lot of read traffic.

Note that the four-status mechanism increments the pg_clog write traffic
instead (a committed subtrans will have the 11 status written first,
then the 01 later). I think this is the lesser of the two evils,
because:

- there is more locality (I/O increment only in pg_clog, not pg_subtrans)
- read and write both use exclusive lock on page (no difference between
them, none is "cheaper")
- after write, pg_clog buffer is marked dirty but not physically written
(unless taken out of the buffer pool)

So I'm opting for Manfred's proposal.

One comment I didn't understand from that proposal was

If we set XMIN/MAX_IS_COMMITTED in a tuple header, we have to
replace a sub-transaction xid in xmin or xmax respectively with
the main-transaction xid at the same time. Otherwise we'd have
to look for the main xid, whenever a tuple is touched.

I suppose this is related to checking whether the xmax/xmin XID is in
the snapshot used by the tester. I'm not sure how to attack this, since
we cannot touch both things "at the same time". One idea would be to
put all subtransactions' XIDs in all snapshots, and just don't update
the xmin/xmax in the tuple. I'm not sure if this is workable.
Explanations and other ideas are welcome.

Memory Management
-----------------

We would have two new global memory contexts:
TopTransactionTreeContext
This one would live for the duration of the whole transaction tree
CommitContext
This one would actually be several contexts, one for each
subtransaction. If the subtrans rolls back, the context is deleted,
but it is kept if the subtrans commits. This facilitates the handling
of memory for things that have to be delayed until main transaction
commit.

The current TopTransactionContext would stay the same as now.

Lock Management
---------------

When a subtransaction aborts, it has to release all LWLocks and
heavyweight locks it's currently holding. On subtrans commit, all locks
are held and reassigned to the parent transaction. Only at main
transaction commit are all locks released.

Note that it's OK to abort a subtrans with LWLocks held, and they should
be released; however, it's not OK to commit a subtrans with LWLocks
held. This should be considered a programming error.

Regarding regular locks, there are two ways of doing this: one is using
the current TransactionId field, releasing those locks which have the
current transaction ID. However this has it's own problems, because
there are a couple of call sites that use the Xid for its own purposes
(apparently it's only LockRelationForSession() -- the other one with a
extraneous call seems to be XactLockTableInsert()/XactLockTableWait()
but apparently they are OK).

The other way would be keeping a List of all the Locks the current
subtrans is holding and release them on abort. This could be difficult
because the mechanism allows to upgrade a held lock, so aborting a
subtrans that upgraded some lock should "downgrade" it to the previous
level, but this sounds dangerous to me. Maybe we should consider
aborting the first subtrans involved in the lock (could be very tricky
to implement).

Regarding LWLocks, the first mechanism is not available so we will have
to stick to keeping a list within each subtrans.

Note regarding deadlock detection: currently, the routine checks for
deadlocks and if found aborts the current transaction. The new
mechanism has to abort the innermost transaction, and then recheck for a
deadlock condition again in a loop.

File deletion, Deferred triggers, Asynchronous notifies
--------------------------------------------------------

These mechanisms will have to keep lists of items corresponding to each
subtransaction, and do something special to it on commit or on abort.
Deftriggers and notifies can be dropped on transaction abort, which is
very easy if we allocate them in the CommitContext. On subtrans commit,
the list of them is reassigned to the parent subtrans.

File deletion is different because we can drop some files on
subtransaction abort. At commit, items are reassigned to parent.

The one other thing to look for is on-commit actions. They require
special handling but are not much different from the above.

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2004-03-23 19:04:15 Re: linked list rewrite
Previous Message Alvaro Herrera 2004-03-23 18:28:43 Re: Two-phase commit