We've expended a lot of worry and discussion in the past about what
happens if the OID generator wraps around. However, there is another
4-byte counter in the system: the transaction ID (XID) generator.
While OID wraparound is survivable, if XIDs wrap around then we really
do have a Ragnarok scenario. The tuple validity checks do ordered
comparisons on XIDs, and will consider tuples with xmin > current xact
to be invalid. Result: after wraparound, your whole database would
instantly vanish from view.
The first thought that comes to mind is that XIDs should be promoted to
eight bytes. However there are several practical problems with this:
* portability --- I don't believe long long int exists on all the
platforms we support.
* performance --- except on true 64-bit platforms, widening Datum to
eight bytes would be a system-wide performance hit, which is a tad
unpleasant to fix a scenario that's not yet been reported from the
* disk space --- letting pg_log grow without bound isn't a pleasant
I believe it is possible to fix these problems without widening XID,
by redefining XIDs in a way that allows for wraparound. Here's my
1. Allow XIDs to range from 0 to WRAPLIMIT-1 (WRAPLIMIT is not
necessarily 4G, see discussion below). Ordered comparisons on XIDs
are no longer simply "x < y", but need to be expressed as a macro.
We consider x < y if (y - x) % WRAPLIMIT < WRAPLIMIT/2.
This comparison will work as long as the range of interesting XIDs
never exceeds WRAPLIMIT/2. Essentially, we envision the actual value
of XID as being the low-order bits of a logical XID that always
increases, and we assume that no extant XID is more than WRAPLIMIT/2
transactions old, so we needn't keep track of the high-order bits.
2. To keep the system from having to deal with XIDs that are more than
WRAPLIMIT/2 transactions old, VACUUM should "freeze" known-good old
tuples. To do this, we'll reserve a special XID, say 1, that is always
considered committed and is always less than any ordinary XID. (So the
ordered-comparison macro is really a little more complicated than I said
above. Note that there is already a reserved XID just like this in the
system, the "bootstrap" XID. We could simply use the bootstrap XID, but
it seems better to make another one.) When VACUUM finds a tuple that
is committed good and has xmin < XmaxRecent (the oldest XID that might
be considered uncommitted by any open transaction), it will replace that
tuple's xmin by the special always-good XID. Therefore, as long as
VACUUM is run on all tables in the installation more often than once per
WRAPLIMIT/2 transactions, there will be no tuples with ordinary XIDs
older than WRAPLIMIT/2.
3. At wraparound, the XID counter has to be advanced to skip over the
InvalidXID value (zero) and the reserved XIDs, so that no real transaction
is generated with those XIDs. No biggie here.
4. With the wraparound behavior, pg_log will have a bounded size: it
will never exceed WRAPLIMIT*2 bits = WRAPLIMIT/4 bytes. Since we will
recycle pg_log entries every WRAPLIMIT xacts, during transaction start
the xact manager will have to take care to actively clear its pg_log
entry to zeroes (I'm not sure if it does that already, or just assumes
that new pg_log entries will start out zero). As long as that happens
before the xact makes any data changes, it's OK to recycle the entry.
Note we are assuming that no tuples will remain in the database with
xmin or xmax equal to that XID from a prior cycle of the universe.
This scheme allows us to survive XID wraparound at the cost of slight
additional complexity in ordered comparisons of XIDs (which is not a
really performance-critical task AFAIK), and at the cost that the
original insertion XIDs of all but recent tuples will be lost by
VACUUM. The system doesn't particularly care about that, but old XIDs
do sometimes come in handy for debugging purposes. A possible
compromise is to overwrite only XIDs that are older than, say,
WRAPLIMIT/4 instead of doing so as soon as possible. This would mean
the required VACUUM frequency is every WRAPLIMIT/4 xacts instead of
every WRAPLIMIT/2 xacts.
We have a straightforward tradeoff between the maximum size of pg_log
(WRAPLIMIT/4 bytes) and the required frequency of VACUUM (at least
every WRAPLIMIT/2 or WRAPLIMIT/4 transactions). This could be made
configurable in config.h for those who're intent on customization,
but I'd be inclined to set the default value at WRAPLIMIT = 1G.
Comments? Vadim, is any of this about to be superseded by WAL?
If not, I'd like to fix it for 7.1.
regards, tom lane
pgsql-hackers by date
|Next:||From: Tom Lane||Date: 2000-11-03 22:56:13|
|Subject: Re: tables permissions once again |
|Previous:||From: Partyka Robert||Date: 2000-11-03 22:33:43|
|Subject: tables permissions once again|