Re: Lowering the default wal_blocksize to 4K

From: Andres Freund <andres(at)anarazel(dot)de>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Subject: Re: Lowering the default wal_blocksize to 4K
Date: 2023-10-11 22:11:26
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


On 2023-10-11 16:05:02 -0400, Robert Haas wrote:
> On Tue, Oct 10, 2023 at 7:29 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
> > > Hmm. I don't think we should remove those checks, as I can see people
> > > that would want to change their XLog block size with e.g.
> > > pg_reset_wal.
> >
> > I don't think that's something we need to address in every physical
> > segment. For one, there's no option to do so. But more importantly, if they
> > don't change the xlog block size, we'll just accept random WAL as well. If
> > somebody goes to the trouble of writing a custom tool, they can live with the
> > consequences of that potentially causing breakage. Particularly if the checks
> > wouldn't meaningfully prevent that anyway.
> I'm extremely confused about what both of you are saying.
> Matthias is referring to pg_reset_wal, which I assume means
> pg_resetwal. But it has no option to change the WAL block size. It
> does have an option to change the WAL segment size, but that's not the
> same thing. And even if pg_resetwal did have an option to change the
> WAL segment size, it removes all WAL from pg_wal when it runs, so you
> wouldn't normally end up trying to replay WAL from before the
> operation because it would have been removed. You might still have
> those files around in an archive or something, but the primary doesn't
> replay from the archive. You might have standbys, but I would assume
> they would have to be rebuilt after changing the WAL block size on the
> master, unless you were trying to follow some probably-too-clever
> procedure to avoid a standby rebuild. So I'm really kind of lost as to
> what the scenario is that Matthias has in mind.

I think the question is what the point of the crosschecks in long page headers
is. It's pretty easy to see what the point of the xlp_sysid check is - make it
less likely to accidentally replay WAL from a different system. It's much
less clear what the point of xlp_seg_size and xlp_xlog_blcksz is - after all,
they are also in ControlFileData and the xlp_sysid check tied the control file
and WAL file together.

> But Andres's response doesn't make any sense to me either. What in the world
> does "if they don't change the xlog block size, we'll just accept random WAL
> as well" mean? Neither having or not having a check that the block size
> hasn't change causes us to "just accept random WAL". To "accept random WAL,"
> we'd have to remove all of the sanity checks, which nobody is proposing and
> nobody would accept.

Let me rephrase my point:

If somebody uses a modified pg_resetwal to change the xlog block size, then
tries to replay WAL from before that change, and is unlucky enough that the
LSN looked for in a segment is the start of a valid record both before/after
the pg_resetwal invocation, then yes, we might not catch that anymore if we
remove the block size check. But the much much more common case is that the
block size was *not* changed, in which case we *already* don't catch that
pg_resetwal was invoked.

ISTM that the xlp_seg_size and xlp_xlog_blcksz checks in long page headers are
a belt and suspenders check that is very unlikely to ever catch a mistake that
wouldn't otherwise be caught.

> But if we do want to keep those cross-checks, why not take what Thomas
> proposed a little further and move all of xlp_sysid, xlp_seg_size, and
> xlp_xlog_blcksz into XLOG_CHECKPOINT_REDO?

I think that's what Thomas was proposing. Thinking about it a bit more I'm
not sure that having the data both in the checkpoint record itself and in
XLOG_CHECKPOINT_REDO buys much. But it's also pretty much free, so ...

> Then long and short page headers would become identical.

Which would make the code more efficient...

> We'd lose the ability to recheck those values for every new segment, but it
> seems quite unlikely that any of these values would change in the middle of
> replay.

I guess the most likely scenario would be a replica that has some local WAL
files (e.g. due to pg_basebackup -X ...), but accidentally configures a
restore_command pointing to the wrong archive. In that case recovery could
start up successfully as the checkpoint/redo records have sensible contents,
but we then wouldn't necessarily notice that the subsequent files aren't from
the correct system. Of course you need to be quite unlucky and have LSNs that
match between the systems. The most likely path I can think of is an idle
system with archive_timeout.

As outlined above, I don't think xlp_seg_size, xlp_xlog_blcksz buy us
anything, but that the protection by xlp_sysid is a bit more meaningful. So a
compromise position could be to include xlp_sysid in the page header, possibly
in a "chopped up" manner, as Matthias suggested.

> If they did, would xl_prev and xl_crc be sufficient to catch that? I think
> Andres says in a later email that they would be, and I think I'm inclined to
> agree. False xl_prev matches don't seem especially unlikely, but xl_crc
> seems like it should be effective.

I think it'd be an entirely tolerable risk. Even if a WAL file were falsely
replayed, we'd still notice the problem soon after. I think once such a
mistake was made, it'd be inadvasible to continue using the cluster anyway.


Andres Freund

In response to


Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2023-10-11 22:16:33 Re: Lowering the default wal_blocksize to 4K
Previous Message Thomas Munro 2023-10-11 22:10:41 Re: The danger of deleting backup_label