Re: Lowering the default wal_blocksize to 4K

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Subject: Re: Lowering the default wal_blocksize to 4K
Date: 2023-10-12 13:46:53
Message-ID: CA+TgmoavR=SSd6ddqUi8WcFgmdz=Ue41dC6YjRWH=LENceqiTA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Oct 11, 2023 at 6:11 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
> I think the question is what the point of the crosschecks in long page headers
> is. It's pretty easy to see what the point of the xlp_sysid check is - make it
> less likely to accidentally replay WAL from a different system. It's much
> less clear what the point of xlp_seg_size and xlp_xlog_blcksz is - after all,
> they are also in ControlFileData and the xlp_sysid check tied the control file
> and WAL file together.

Yeah, fair.

> Let me rephrase my point:
>
> If somebody uses a modified pg_resetwal to change the xlog block size, then
> tries to replay WAL from before that change, and is unlucky enough that the
> LSN looked for in a segment is the start of a valid record both before/after
> the pg_resetwal invocation, then yes, we might not catch that anymore if we
> remove the block size check. But the much much more common case is that the
> block size was *not* changed, in which case we *already* don't catch that
> pg_resetwal was invoked.

Hmm. Should we invent a mechanism just for that?

> ISTM that the xlp_seg_size and xlp_xlog_blcksz checks in long page headers are
> a belt and suspenders check that is very unlikely to ever catch a mistake that
> wouldn't otherwise be caught.

I think that's probably right.

> I think that's what Thomas was proposing. Thinking about it a bit more I'm
> not sure that having the data both in the checkpoint record itself and in
> XLOG_CHECKPOINT_REDO buys much. But it's also pretty much free, so ...

Yes. To me, having it in the redo record seems considerably more
valuable. Because that's where we're going to begin replay, so we
should catch most problems straight off. To escape detection at that
point, you need to not just be pointed at the wrong WAL archive, but
actually have files of diverse origin mixed together in the same WAL
archive. That's a less-likely error, and we still have some ways of
catching it if it happens.

> Which would make the code more efficient...

Right.

> As outlined above, I don't think xlp_seg_size, xlp_xlog_blcksz buy us
> anything, but that the protection by xlp_sysid is a bit more meaningful. So a
> compromise position could be to include xlp_sysid in the page header, possibly
> in a "chopped up" manner, as Matthias suggested.

I'm not that keen on the idea of storing the upper half and lower half
in alternate pages. That seems to me to add code complexity and
cognitive burden with little increased likelihood of catching real
problems. I'm not completely opposed to the idea if somebody wants to
make it happen, but I bet it would be better to either store the whole
thing or just cut it in half and store, say, the low-order bits.

--
Robert Haas
EDB: http://www.enterprisedb.com

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Sergei Glukhov 2023-10-12 13:47:03 Re: Problem, partition pruning for prepared statement with IS NULL clause.
Previous Message Daniel Gustafsson 2023-10-12 13:40:00 Re: [RFC] Add jit deform_counter