Consider the following scenario:
1. A new transaction inserts a tuple. The tuple is entered into its
heap file with the new transaction's XID, and an associated WAL log
entry is made. Neither one of these are on disk yet --- the heap tuple
is in a shmem disk buffer, and the WAL entry is in the shmem WAL buffer.
2. Now do a lot of read-only operations, in the same or another backend.
The WAL log stays where it is, but eventually the shmem disk buffer will
get flushed to disk so that the buffer can be re-used for some other
3. Assume we now crash. Now, we have a heap tuple on disk with an XID
that does not correspond to any XID visible in the on-disk WAL log.
4. Upon restart, WAL will initialize the XID counter to the first XID
not seen in the WAL log. Guess which one that is.
5. We will now run a new transaction with the same XID that was in use
before the crash. If that transaction commits, then we have a tuple on
disk that will be considered valid --- and should not be.
After thinking about this for a little, it seems to me that XID
assignment should be handled more like OID assignment: rather than
handing out XIDs one-at-a-time, varsup.c should allocate them in blocks,
and should write an XLOG record to reflect the allocation of each block
of XIDs. Furthermore, the above example demonstrates that *we must
flush that XLOG entry to disk* before we can start to actually hand out
the XIDs. This ensures that the next system cycle won't re-use any XIDs
that may have been in use at the time of a crash.
OID assignment is not quite so critical. Consider again the scenario
above: we don't really care if after restart we reuse the OID that was
assigned to the crashed transaction's inserted tuple. As long as the
tuple itself is not considered committed, it doesn't matter what OID it
contains. So, it's not necessary to force XLOG flush for OID-assignment
In short then: make the XID allocation machinery just like the OID
allocation machinery presently is, plus an XLogFlush() after writing
the NEXTXID XLOG record.
regards, tom lane
PS: oh, another thing: redo of a checkpoint record ought to advance the
XID and OID counters to be at least what the checkpoint record shows.
pgsql-hackers by date
|Next:||From: Tom Lane||Date: 2001-03-05 19:00:59|
|Subject: Re: Uh, this is *not* a 64-bit CRC ... |
|Previous:||From: Zeugswetter Andreas SB||Date: 2001-03-05 16:47:33|
|Subject: AW: AW: WAL & RC1 status |