Re: problems with making relfilenodes 56-bits

From: Andres Freund <andres(at)anarazel(dot)de>
To: Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Subject: Re: problems with making relfilenodes 56-bits
Date: 2022-10-04 15:34:09
Message-ID: 20221004153409.ukahqfpudxvbiut4@awork3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2022-10-04 15:05:47 +0200, Matthias van de Meent wrote:
> On Mon, 3 Oct 2022 at 23:26, Andres Freund <andres(at)anarazel(dot)de> wrote:
> > On 2022-10-03 19:40:30 +0200, Matthias van de Meent wrote:
> > > On Mon, 3 Oct 2022, 19:01 Andres Freund, <andres(at)anarazel(dot)de> wrote:
> > > > Random idea: xl_prev is large. Store a full xl_prev in the page header, but
> > > > only store a 2 byte offset from the page header xl_prev within each record.
> > >
> > > With that small xl_prev we may not detect partial page writes in
> > > recycled segments; or other issues in the underlying file system. With
> > > small record sizes, the chance of returning incorrect data would be
> > > significant for small records (it would be approximately the chance of
> > > getting a record boundary on the underlying page boundary * chance of
> > > getting the same MAXALIGN-adjusted size record before the persistence
> > > boundary). That issue is part of the reason why my proposed change
> > > upthread still contains the full xl_prev.
> >
> > What exactly is the theory for this significant increase? I don't think
> > xl_prev provides a meaningful protection against torn pages in the first
> > place?
>
> XLog pages don't have checksums, so they do not provide torn page
> protection capabilities on their own.
> A singular xlog record is protected against torn page writes through
> the checksum that covers the whole record - if only part of the record
> was written, we can detect that through the mismatching checksum.
> However, if records end at the tear boundary, we must know for certain
> that any record that starts after the tear is the record that was
> written after the one before the tear. Page-local references/offsets
> would not work, because the record decoding doesn't know which xlog
> page the record should be located on; it could be both the version of
> the page before it was recycled, or the one after.
> Currently, we can detect this because the value of xl_prev will point
> to a record far in the past (i.e. not the expected value), but with a
> page-local version of xl_prev we would be less likely to detect torn
> pages (and thus be unable to handle this without risk of corruption)
> due to the significant chance of the truncated xl_prev value being the
> same in both the old and new record.

Think this is addressable, see below.

> Example: Page { [ record A ] | tear boundary | [ record B ] } gets
> recycled and receives a new record C at the place of A with the same
> length.
>
> With your proposal, record B would still be a valid record when it
> follows C; as the page-local serial number/offset reference to the
> previous record would still match after the torn write.
> With the current situation and a full LSN in xl_prev, the mismatching
> value in the xl_prev pointer allows us to detect this torn page write
> and halt replay, before redoing an old (incorrect) record.

In this concrete scenario the 8 byte xl_prev doesn't provide *any* protection?
As you specified it, C has the same length as A, so B's xl_prev will be the
same whether it's a page local offset or the full 8 bytes.

The relevant protection against issues like this isn't xl_prev, it's the
CRC. We could improve the CRC by using the "full width" LSN for xl_prev rather
than the offset.

> PS. there are ideas floating around (I heard about this one from
> Heikki) where we could concatenate WAL records into one combined
> record that has only one shared xl_prev+crc; which would save these 12
> bytes per record. However, that needs a lot of careful consideration
> to make sure that the persistence guarantee of operations doesn't get
> lost somewhere in the traffic.

One version of that is to move the CRCs to the page header, make the pages
smaller (512 bytes / 4K, depending on the hardware), and to pad out partial
pages when flushing them out. Rewriting pages is bad for hardware and prevents
having multiple WAL IOs in flight at the same time.

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Eisentraut 2022-10-04 15:45:32 Allow tests to pass in OpenSSL FIPS mode
Previous Message Tom Lane 2022-10-04 13:29:36 Re: get rid of Abs()