Re: Cost of XLogInsert CRC calculations

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Greg Stark <gsstark(at)mit(dot)edu>
Cc: "Mark Cave-Ayland" <m(dot)cave-ayland(at)webbased(dot)co(dot)uk>, "'Manfred Koizar'" <mkoi-pg(at)aon(dot)at>, "'Bruce Momjian'" <pgman(at)candle(dot)pha(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Cost of XLogInsert CRC calculations
Date: 2005-05-31 16:27:18
Message-ID: 13044.1117556838@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Greg Stark <gsstark(at)mit(dot)edu> writes:
> Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:
>>> Is the random WAL data really the concern? It seems like a more reliable way
>>> of dealing with that would be to just accompany every WAL entry with a
>>> sequential id and stop when the next id isn't the correct one.
>>
>> We do that, too (the xl_prev links and page header addresses serve that
>> purpose). But it's not sufficient given that WAL records can span pages
>> and therefore may be incompletely written.

Actually, on reviewing the code I notice two missed bets here.

1. During WAL replay, we aren't actually verifying that xl_prev matches
the address of the prior WAL record. This means we are depending only
on the page header addresses to make sure we don't replay stale WAL data
left over from the previous cycle of use of the physical WAL file. This
is fairly dangerous, considering the likelihood of partial write of a
WAL page during a power failure: the first 512-byte sector(s) of a page
may have been updated but not the rest. If an old WAL record happens to
start at exactly the sector boundary then we lose.

2. We store a separate CRC for each backup block attached to a WAL
record. Therefore the same torn-page problem could hit us if a stale
backup block starts exactly at a intrapage sector boundary --- there is
nothing guaranteeing that the backup block really goes with the WAL
record.

#1 seems like a pretty critical, but also easily fixed, bug. To fix #2
I suppose we'd have to modify the WAL format to store just one CRC
covering the whole of a WAL record and its attached backup block(s).

I think the reasoning behind the separate CRCs was to put a limit on
the amount of data guarded by one CRC, and thereby minimize the risk
of undetected errors. But using the CRCs in this way is failing to
guard against exactly the problem that we want the CRCs to guard against
in the first place, namely torn WAL records ... so worrying about
detection reliability seems misplaced.

The odds of an actual failure from case #2 are fortunately not high,
since a backup block will necessarily span across at least one WAL page
boundary and so we should be able to detect stale data by noting that
the next page's header address is wrong. (If it's not wrong, then at
least the first sector of the next page is up-to-date, so if there is
any tearing the CRC should tell us.) Therefore I don't feel any need
to try to work out a back-patchable solution for #2. But I do think we
ought to change the WAL format going forward to compute just one CRC
across a WAL record and all attached backup blocks. There was talk of
allowing compression of backup blocks, and if we do that we could no
longer feel any certainty that a page crossing would occur.

Thoughts?

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Marc G. Fournier 2005-05-31 17:41:14 Deadlock with tsearch2 index ...
Previous Message Greg Stark 2005-05-31 16:02:12 Re: Cost of XLogInsert CRC calculations