Re: Changing WAL Header to reduce contention during ReserveXLogInsertLocation()

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Changing WAL Header to reduce contention during ReserveXLogInsertLocation()
Date: 2018-01-12 15:45:54
Message-ID: 6772.1515771954@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> On Sat, Dec 30, 2017 at 5:32 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
>> So we can't completely remove xl_prev field, without giving up some
>> functionality. But we don't really need to store the 8-byte previous
>> WAL pointer in order to detect torn pages. Something else which can
>> tell us that the WAL record does not belong to current WAL segno would
>> be enough as well. I propose that we replace it with a much smaller
>> 2-byte field (let's call it xl_walid). The "xl_walid" (or whatever we
>> decide to call it) is the low order 16-bits of the WAL segno to which
>> the WAL record belongs. While reading WAL, we always match that the
>> "xl_walid" value stored in the WAL record matches with the current WAL
>> segno's lower order 16-bits and if not, then consider that as the end
>> of the stream.
>>
>> For this to work, we must ensure that WAL files are either recycled in
>> such a way that the "xl_walid" of the previous (to be recycled) WAL
>> differs from the new WAL or we zero-out the new WAL file. Seems quite
>> easy to do with the existing infrastructure.

> I have some reservations about whether this makes the mechanism less
> reliable.

Yeah, it scares me too. The xl_prev field is our only way of detecting
that we're looking at old WAL data when we cross a sector boundary.
I have no faith that we can prevent old WAL data from reappearing in the
file system across an OS crash, so I find Simon's assertion that we can
dodge the problem through file manipulation to be simply unbelievable.

If we could be sure that the WAL page size was no larger than the file
system's write quantum, then checking xlp_pageaddr would be sufficient
to detect stale WAL data. But I'm afraid 8K is way too big for that;
so we need to be able to recognize page tears within-page.

> Of course, we also have xl_crc, so I'm not sure whether there's any
> chance of real harm...

The CRC only tells you that you have a valid WAL record, it won't clue
you in that it's old data you shouldn't replay. If the previous WAL
record crossed the torn-page boundary, then you should have gotten a
CRC failure on that record --- but if the previous record ended at a
sector boundary, recognizing that the new record has an old xl_prev is
our ONLY defense against replaying stale data.

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Eisentraut 2018-01-12 15:49:53 Re: improve type conversion of SPI_processed in Python
Previous Message Aleksander Alekseev 2018-01-12 15:43:07 Re: Jsonb transform for pl/python