Re: Deriving Recovery Snapshots

From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Deriving Recovery Snapshots
Date: 2008-10-22 11:58:24
Message-ID: 1224676704.27145.249.camel@ebony.2ndQuadrant
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


On Wed, 2008-10-22 at 12:29 +0300, Heikki Linnakangas wrote:

> How about:
>
> 1. Keep all transactions and subtransactions in UnobservedXids.
> 2. If it fills up, remove all subtransactions from it, that the startup
> process knows to be subtransactions and knows the parents, and update
> subtrans. Mark the array as overflowed.
>
> To take a snapshot, a backend simply copies UnobservedXids array and the
> flag. If it hasn't overflowed, a transaction is considered to be in
> progress if it's in the array. If it has overflowed, and the xid is not
> in the array, check subtrans

We can't check subtrans. We do not have any record of what the parent is
for an unobserved transaction id. So the complete list of unobserved
xids *must* be added to the snapshot. If that makes snapshot overflow,
we have a big problem: we would be forced to say "sorry snapshot cannot
be issued at this time, please wait". Ugh!

> Note that the startup process sees all WAL records, so it can do
> arbitrarily complex bookkeeping in backend-private memory, and only
> expose the necessary parts in shared mem. For example, it can keep track
> of the parent-child relationships of the xids in UnobservedXids, but the
> backends taking snapshots don't need to know about that. For step 2 to
> work, that's exactly what the startup process needs to keep track of.

> For the startup process to know about the parent-child relationships,
> we'll need something like WAL changes you suggested. I'm not too
> thrilled about adding a new field to all WAL records. Seems simpler to
> just rely on the new WAL records on AssignTransactionId(), and we can
> only do it, say, every 100 subtransactions, if we make the
> UnobservedXids array big enough (100*max_connections).

Yes, we can make the UnobservedXids array bigger, but only to the point
where it will all fit within a snapshot.

The WAL changes proposed use space that was previously wasted, so there
is no increase in amount of data going to disk. The additional time to
derive that data is very quick when those fields are unused and that
logic is executed before we take WALInsertLock. So overall, very low
overhead.

Every new subxid needs to specify its parent's xid. We must supply that
information somehow: either via an XLOG_XACT_ASSIGNMENT, or as I have
done in most cases, tuck that into the wasted space on the xlrec.
Writing a WAL record every 100 subtransactions will not work: we need to
write to subtrans *before* that xid appears anywhere on disk, so that
visibility tests can determine the status of the transaction.

The approach I have come up with is very finely balanced. It's the
*only* approach that I've come up with that covers all requirements;
there were very few technical choices to make. If it wasn't for
subtransactions, disappearing transactions because of FATAL errors and
unobserved xids it would be much simpler. But having said that, the code
isn't excessively complex, I wrote it in about 3 days.

> This isn't actually that different from your proposal. The big
> difference is that instead of PROC entries and UnobservedXids, all
> transactions are tracked in UnobservedXids, and instead of caching
> subtransactions in the subxids array in PROC entries, they're cached in
> UnobservedXids as well.

> Aanother, completely different approach, would be to forget about xid
> arrays altogether, and change the way snapshots are taken: just do a
> full memcpy of the clog between xmin and xmax. That might be pretty slow
> if xmax-xmin is big, though.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Magnus Hagander 2008-10-22 12:01:26 Re: crypt auth
Previous Message Merlin Moncure 2008-10-22 11:54:52 Re: binary representation of datatypes