Nested transactions: low level stuff

From: Manfred Koizar <mkoi-pg(at)aon(dot)at>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Nested transactions: low level stuff
Date: 2003-03-19 07:28:09
Message-ID: 3u5g7v4qe59jj31vc78fbm120ak0flnm2l@4ax.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Here is, what I've put together from various messages posted
November/December last year.

. Subtrans trees
. Transaction states
. Tuple visibility
. HeapTupleSatifiesUpdate
. Shortcuts
. Still missing
. Objections and suggestions

Subtrans trees
--------------

A subtrans tree holds information about sub-transaction to parent
transaction relationships. The nodes are transaction ids, edges
denote the fact that the child node represents a sub-transaction of
the transaction represented by the parent node. The root of a
subtrans tree represents a main-transaction.

A subtrans tree is navigable from child to parent. I.e. if you have
a sub-transaction xid you can find its parent efficiently, but there
is no fast and easy way to enumerate all children of a given parent
xid.

The subtrans dependencies have to be visible to all backends.

At least two possible implementations have been discussed.

Together with clog: Store a 30 bit parent transaction offset with the
two state bits. Offset 0 means the transaction is a main-transaction.

In a separate data structure: pg_subtrans is a huge (possibly
sparsely populated) array of parent xids, indexed by child xid. This
array is divided into manageable chunks, represented by files,
"pg_subtrans_NNNN". These files are only created when necessary. At
any time only a tiny part of the whole array is kept in shared
buffers. This concept is similar or almost equal to pg_clog, which is
an array of doublebits. Most of its implementation can be stolen from
the clog code.

We believe that parent xid lookups are far less frequent than
transaction state lookups and therefor favour the second
implementation.

Transaction states
------------------

Let the currently unused fourth state in
pg_clog indicate a committed subtransaction. In pg_clog there are two
bits per transaction, commit and abort, with the following meaning:

a c
0 0 transaction in progress, the owning backend knows whether it is
a main- or a sub-transaction, other backends don't care
1 0 aborted, nobody cares whether main- or sub-transaction
0 1 committed main-transaction or - with shortcut 2 - a sub-
transaction that's known committed to all active transactions
1 1 committed sub-transaction, have to look for parent in
pg_subtrans

Updating the state of a single (main or sub) transaction is atomic,
just like it is now. This solves the atomicity problem without
requiring long locks.

Here is what is to be done for some operations:

BEGIN main transaction:
. Get a new xid (no change to current behaviour).
. pg_clog[xid] is still 00, meaning active.
. pg_subtrans[xid] is still 0, meaning no parent.

BEGIN subtransaction:
. Push current transaction info onto local stack.
. Get a new xid.
. Record parent xid in pg_subtrans[xid].
. pg_clog[xid] is still 00.

ROLLBACK subtransaction:
. Set pg_clog[xid] to 10 (aborted).
. Optionally set clog bits for subsubtransactions to 10 (*).
. Pop transaction info from stack.

COMMIT subtransaction:
. Set pg_clog[xid] to 11 (committed subtrans).
. Don't touch clog bits for subsubtransactions!
. Pop transaction info from stack.

ROLLBACK main transaction:
. Set pg_clog[xid] to 10 (aborted).
. Optionally set clog bits for subtransactions to 10 (*).

COMMIT main transaction:
. Set pg_clog[xid] to 01 (committed).
. Don't touch clog bits for subtransactions!

(*) This can only be done if there is a transaction local data
structure holding information about all subtransactions or we have a
subtrans tree implementation that is navigable from parent to child.
Anyway, whether this is done or not does not influence the consistency
of this proposal.

Tuple visibility
----------------

Visibility check by other transactions: If a tuple is visited and its
XMIN/XMAX_IS_COMMITTED/ABORTED flags are not yet set, pg_clog has to
be consulted to find out the status of the inserting/deleting
transaction xid. If pg_clog[xid] is ...

00: transaction still active

10: aborted

01: committed

11: committed subtransaction, have to check parent

Only in this last case do we have to get parentxid from pg_subtrans.
Now we look at pg_clog[parentxid]. If we find ...

00: parent still active, so xid is considered active, too

10: parent aborted, so xid is considered aborted

01: parent committed, so xid is considered committed

11: recursively check grandparent(s) ...

If a visibility check sees a transaction as committed, it checks
whether the main-transaction xid is visible in the current snapshot.
Fortunately we get the main-transaction xid as a by-product of
looking up the parentxid states.

If we set XMIN/MAX_IS_COMMITTED in a tuple header, we have to replace
a sub-transaction xid in xmin or xmax respectively with the
main-transaction xid at the same time. Otherwise we'd have to look
for the main xid, whenever a tuple is touched.

XXX "at the same time" may need a bit more thought regarding locking.

HeapTupleSatifiesUpdate
-----------------------

If an UPDATE is about to change a tuple touched by another active
transaction, it waits for the other transaction to commit or abort.
We must always wait for the main transaction, not the subtrans.

Shortcuts
---------

Under certain conditions the 1/1 state can be replaced with 0/1 or
1/0. This could save a lot of parent lookups.

Shortcut 1: As soon as a transaction is rolled back all its
subtransactions are considered aborted, too. So their pg_clog bits
can be set to 10 and there is no need to look up their parent any
more. This can be done
. immediately on ROLLBACK (if the implementation has a list of
committed subtransactions handy),
. as a side effect of a later visibility check,
. by VACUUM.

Shortcut 2: If a committed subtransaction has only committed
(grand)parents up to the committed main transaction and the main
transaction is visible to all running transactions, i.e. the main
transaction's xid is before RecentGlobalXmin (*), then the
subtransaction's pg_clog bits can be set to 01 and further parent xid
lookups are not needed any more. This can be done
. as a side effect of a later visibility check,
. by VACUUM.

(*) AFAICS the statements "main transaction's xid is before
RecentGlobalXmin" and "subtransaction's xid is before
RecentGlobalXmin" are equivalent.

The point here is: pg_subtrans is only queried, if we find 11 in
pg_clog; we try to replace 11 a soon as possible, but only if it is
safe to do so. Status 11 is short-lived, it is replaced with
the final status by one or more of

- ROLLBACK of the parent transaction
- a later visibility check (as a side effect)
- VACUUM

pg_subtrans cleanup: A pg_subtrans_NNNN file covers a known range of
transaction ids. As soon as none of these transactions has a pg_clog
status of 11, the pg_subtrans_NNNN file can be removed. VACUUM can do
this, and it won't even have to check the heap.

Still missing
-------------

. Visibility checks for tuples inserted/deleted by a (sub)transaction
belonging to the current transaction tree (have to check local
transaction stack whenever we look at a xid or switch to a parent xid)

. WAL

Objections and suggestions
--------------------------

Tom:
>Also, using a 11 state doubles the amount of pg_clog I/O needed to
>commit a collection of subtransactions. You have to write 11 as the
>state of each commitable subtransaction, then commit the parent (write
>01 as its state), then go back and change the state of each
>subtransaction to 01. (Whether this last bit is done as part of parent
>transaction commit, or during later inspections of the state of the
>subtransaction, doesn't change the argument.)

Yes, there may be more pg_clog I/O. But I don't think that it is
doubled. I'd expect some locality; after all one page holds the
status bits for 32K transactions.

Tom:
>I think it would be preferable to use only three states: active,
>aborted, committed. The parent commit protocol is (1) write 10 as state
>of each aborted subtransaction (this should be done as soon as the
>subtransaction is known aborted, rather than delaying to parent commit);
>(2) write 01 as state of parent (this is the atomic commit); (3) write
>01 as state of each committed subtransaction. Readers who see 00 must
>check the parent state; if the parent is committed then they have to go
>back and recheck the child state (to see if it became "aborted" after
>they looked). This halves the write traffic during a commit, at the
>cost of additional read traffic when subtransaction state is checked in
>a narrow window after the time of parent transaction commit. I believe
>it nets out to be faster.

Pro: Saves the fourth state for something different.

Con: Does a pg_subtrans lookup even for main transactions (at least
until xid < RecentGlobalXmin)

Con: Needs subtrans tree navigation from parent to child.

It's noticeably past midnight as I write this ... will reconsider
that approach after I had same sleep.

Tom:
>But wait a moment. The child xid is by definition always greater than
>(newer than) its parent. So if we consult pg_clog and find the
>transaction marked committed, *and* the xid is before the window of XIDs
>in our snapshot, then even if it's not a top-level xid, the parent must
>be before our window too. Therefore we can conclude the transaction is
>visible in our snapshot. So indeed there is a good-size range of xids
>for which we'll never need to chase the parent link: everything before
>the RecentGlobalXmin computed by GetSnapshotData.

I think this has been integrated into this proposal (cf. shortcut 2).

> (We do have to set
>subtransactions to committed during parent commit to make this true;
>we can't update them lazily. But I think that's okay.)

This is not clear to me ... I'll try again later ...

------------------------------

That's all for now, folks. Fell free to comment.

Sorry for the long post. Would you prefer such kind of stuff on a web
page and just a short note with the URL to the list?

Servus
Manfred

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jinqiang Han 2003-03-19 08:32:25 inquiery
Previous Message Ronald Kuczek 2003-03-19 06:30:14 Win32 native port