Re: Cost of XLogInsert CRC calculations

From: Simon Riggs <simon(at)2ndquadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Greg Stark <gsstark(at)mit(dot)edu>, Mark Cave-Ayland <m(dot)cave-ayland(at)webbased(dot)co(dot)uk>, 'Manfred Koizar' <mkoi-pg(at)aon(dot)at>, 'Bruce Momjian' <pgman(at)candle(dot)pha(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Cost of XLogInsert CRC calculations
Date: 2005-05-31 23:19:12
Message-ID: 1117581552.3844.797.camel@localhost.localdomain
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, 2005-05-31 at 12:27 -0400, Tom Lane wrote:
> Greg Stark <gsstark(at)mit(dot)edu> writes:
> > Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:
> >>> Is the random WAL data really the concern? It seems like a more reliable way
> >>> of dealing with that would be to just accompany every WAL entry with a
> >>> sequential id and stop when the next id isn't the correct one.
> >>
> >> We do that, too (the xl_prev links and page header addresses serve that
> >> purpose). But it's not sufficient given that WAL records can span pages
> >> and therefore may be incompletely written.
>
> Actually, on reviewing the code I notice two missed bets here.
>
> 1. During WAL replay, we aren't actually verifying that xl_prev matches
> the address of the prior WAL record. This means we are depending only
> on the page header addresses to make sure we don't replay stale WAL data
> left over from the previous cycle of use of the physical WAL file. This
> is fairly dangerous, considering the likelihood of partial write of a
> WAL page during a power failure: the first 512-byte sector(s) of a page
> may have been updated but not the rest. If an old WAL record happens to
> start at exactly the sector boundary then we lose.

Hmmm. I seem to recall asking myself why xl_prev existed if it wasn't
used, but passed that by. Damn.

> 2. We store a separate CRC for each backup block attached to a WAL
> record. Therefore the same torn-page problem could hit us if a stale
> backup block starts exactly at a intrapage sector boundary --- there is
> nothing guaranteeing that the backup block really goes with the WAL
> record.
>
> #1 seems like a pretty critical, but also easily fixed, bug. To fix #2
> I suppose we'd have to modify the WAL format to store just one CRC
> covering the whole of a WAL record and its attached backup block(s).
>
> I think the reasoning behind the separate CRCs was to put a limit on
> the amount of data guarded by one CRC, and thereby minimize the risk
> of undetected errors. But using the CRCs in this way is failing to
> guard against exactly the problem that we want the CRCs to guard against
> in the first place, namely torn WAL records ... so worrying about
> detection reliability seems misplaced.
>
> The odds of an actual failure from case #2 are fortunately not high,
> since a backup block will necessarily span across at least one WAL page
> boundary and so we should be able to detect stale data by noting that
> the next page's header address is wrong. (If it's not wrong, then at
> least the first sector of the next page is up-to-date, so if there is
> any tearing the CRC should tell us.) Therefore I don't feel any need
> to try to work out a back-patchable solution for #2. But I do think we
> ought to change the WAL format going forward to compute just one CRC
> across a WAL record and all attached backup blocks. There was talk of
> allowing compression of backup blocks, and if we do that we could no
> longer feel any certainty that a page crossing would occur.
>
> Thoughts?

PreAllocXLog was already a reason to have somebody prepare new xlog
files ahead of them being used. Surely the right solution here is to
have that agent prepare fresh/zeroed files prior to them being required.
That way no stale data can ever occur and both of these bugs go
away....

Fixing that can be backpatched so that the backend that switches files
can do the work, rather than bgwriter [ or ?].

Best Regards, Simon Riggs

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Simon Riggs 2005-05-31 23:32:47 Re: Quick-and-dirty compression for WAL backup blocks
Previous Message Jonah H. Harris 2005-05-31 23:06:40 Re: Tablespace-level Block Size Definitions