WAL Recovery Bug in 7.2.3

From: Josh Berkus <josh(at)agliodbs(dot)com>
To: pgsql-bugs(at)postgresql(dot)org
Subject: WAL Recovery Bug in 7.2.3
Date: 2003-01-16 21:55:26
Message-ID: 200301161355.26027.josh@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs


Affects: 7.2.3 (possibly down to 7.1.0)
Frequency: Very Rare
Effect When Occurring: Database corruption
Difficulty of Fix: Trivial
Certianty of Diagnosis: about 50% according to Tom

This bug was analyzed by Tom Lane; I'm just writing it up.

In 7.2.3 (and possibly in earlier versions) these two rows in xlog.c are out
of order:
FlushBufferPool();
CheckPointCLOG();

Per Tom's analysis:
"7.2.* does checkpoint operations in the wrong order: CreateCheckPoint
does
FlushBufferPool();
CheckPointCLOG();
... create and write checkpoint WAL record ...
The reason this is the wrong order is that CheckPointCLOG() only issues
write()s of dirty pg_clog pages; it does not fsync them. Thus, it is
possible that the checkpoint WAL record will be flushed to disk while
the clog page writes are still hanging about in kernel disk cache.
If there is a system crash before the kernel gets around to sync'ing
the dirty clog pages to disk, then we lose --- on restart, the WAL logic
will only replay WAL entries after the latest checkpoint, and so any
transaction commits occurring before the checkpoint would fail to be
re-marked in pg_clog."

As an error scenario, this seems rather farfetched; Postgres would have to be
killed, a second time, while in recovery mode at a moment between
FlushBufferPool() and CheckPointCLOG(). A remote enough possibility to
ignore. Except that it seems to have happened twice, to two different
users.

The scenario under which this bug becomes critical is this:

1) In the middle of a large UPDATE statement, the Postgres server loses power
from a general power outage or local building short.
2) This server is not buffered by a UPS.
3) Due to work on the power system or weather damage, power comes back on,
then off after a few minutes, cycling off-and-on 4-5 times (this is not
farfetched; during the California "power crisis" I saw it happen several
times).
4) This has the possible effect of repeatedly downing Postgres while it is in
recovery mode.
5) Sooner or later, the up-down effect "gets lucky" and postgres goes down
while FlushBufferCache() is finishing up.
6) The user ends up with two versions of one or more of their records marked
as valid by Postgres. Per Tom's analysis of one such problem:

"Well, here's what I've found so far. The two tuples in question have
header data like so (as printed by pg_filedump):

Item 28 -- Length: 248 Offset: 7944 (0x1f08) Flags: USED
OID: 487894 CID: min(0) max(0) XID: min(9776912) max(17920315)
Block Id: 4664 linp Index: 1 Attributes: 31 Size: 36
infomask: 0x0903 (HASNULL|HASVARLENA|XMIN_COMMITTED|XMAX_INVALID)
t_bits: [0]: 0xff [1]: 0xfe [2]: 0xdf [3]: 0x7d

Item 2 -- Length: 248 Offset: 7944 (0x1f08) Flags: USED
OID: 487894 CID: min(0) max(0) XID: min(9777615) max(10180711)
Block Id: 4666 linp Index: 1 Attributes: 31 Size: 36
infomask: 0x2903 (HASNULL|HASVARLENA|XMIN_COMMITTED|XMAX_INVALID|UPDATED)
t_bits: [0]: 0xff [1]: 0xfe [2]: 0xdf [3]: 0x7d

What appears to have happened is this: transaction 9776912 created the
row initially (the first of the two items is evidently the first
incarnation of the row, since it does not have HEAP_UPDATED set). A
little while later, transaction 9777615 updated the row, creating the
second tuple. Our problem is that both tuples appear to be committed
good --- both have XMIN_COMMITTED set.

Digging into the pg_clog data, I find that 9776912 is shown as
"committed", as expected. But 9777615 is shown as "in progress" --- the
clog entry has not been marked as either committed or aborted!"

Since this bug has been fixed in 7.3.1, it's not cirtical to release a patch.
HOWEVER, given the triviality of the fix ... simply swapping those two lines
in xlog.c ... does everyone think it would be a good idea to post a notice on
the lists of the issue and the fix?

While it easy enough to tell users, "Upgrade, or get a UPS" this is not
practical for everyone.

And is there any potential issue with swapping those two lines?

--
-Josh Berkus
Aglio Database Solutions
San Francisco

Browse pgsql-bugs by date

  From Date Subject
Next Message bigapple 2003-01-17 05:54:10 permission leak
Previous Message Sharanayya G M 2003-01-16 12:58:58 Unable to Install PostGreSQL on Red Hat Linux release 7.2