Re: problems with making relfilenodes 56-bits

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Subject: Re: problems with making relfilenodes 56-bits
Date: 2022-10-04 18:53:34
Message-ID: CA+TgmoYP=w5j3TCdmsZt=HtQMBe+E2X58YkZGvk+vvCdGStmWw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Oct 4, 2022 at 2:30 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
> Consider the following sequence:
>
> 1) we write WAL like this:
>
> [record A][tear boundary][record B, prev A_lsn][tear boundary][record C, prev B_lsn]
>
> 2) crash, the sectors with A and C made it to disk, the one with B didn't
>
> 3) We replay A, discover B is invalid (possibly zeroes or old contents),
> insert a new record B' with the same length. Now it looks like this:
>
> [record A][tear boundary][record B', prev A_lsn][tear boundary][record C, prev B_lsn]
>
> 4) crash, the sector with B' makes it to disk
>
> 5) we replay A, B', C, because C has an xl_prev that's compatible with B'
> location and a valid CRC.
>
> Oops.
>
> I think this can happen both within a single page and across page boundaries.
>
> I hope I am missing something here?

If you are, I don't know what it is off-hand. That seems like a
plausible scenario to me. It does require the OS to write things out
of order, and I don't know how likely that is in practice, but the
answer probably isn't zero.

> I think there might be reasonable ways to increase the guarantees based on the
> 2 byte xl_prev approach "alone". We don't have to store the offset from the
> page header as a plain offset. What about storing something like:
> page_offset ^ (page_lsn >> wal_segsz_shift)
>
> I think something like that'd result in prev_not_really_lsn typically not
> simply matching after recycling. Of course it only provides so much
> protection, given 16bits...

Maybe. That does seem somewhat better, but I feel like it's hard to
reason about whether it's safe in absolute terms or just resistant to
the precise scenario Matthias postulated while remaining vulnerable to
slightly modified versions.

How about this: remove xl_prev. widen xl_crc to 64 bits. include the
CRC of the previous WAL record in the xl_crc calculation. That doesn't
cut quite as many bytes out of the record size as your proposal, but
it feels like it should strongly resist pretty much every attack of
this general type, with only the minor disadvantage that the more
expensive CRC calculation will destroy all hope of getting anything
committed.

--
Robert Haas
EDB: http://www.enterprisedb.com

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Nikita Malakhov 2022-10-04 19:41:22 Re: Pluggable toaster
Previous Message Andres Freund 2022-10-04 18:35:53 Re: installcheck-world concurrency issues