Re: Lowering the default wal_blocksize to 4K

From: Andres Freund <andres(at)anarazel(dot)de>
To: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc: Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Robert Haas <robertmhaas(at)gmail(dot)com>
Subject: Re: Lowering the default wal_blocksize to 4K
Date: 2023-10-11 02:47:44
Message-ID: 20231011024744.hyqhahep6lpvv4pp@awork3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2023-10-11 14:39:12 +1300, Thomas Munro wrote:
> On Wed, Oct 11, 2023 at 12:29 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
> > On 2023-10-10 21:30:44 +0200, Matthias van de Meent wrote:
> > > On Tue, 10 Oct 2023 at 06:14, Andres Freund <andres(at)anarazel(dot)de> wrote:
> > > > I was thinking we should perhaps do the opposite, namely getting rid of short
> > > > page headers. The overhead in the "byte position" <-> LSN conversion due to
> > > > the differing space is worse than the gain. Or do something inbetween - having
> > > > the system ID in the header adds a useful crosscheck, but I'm far less
> > > > convinced that having segment and block size in there, as 32bit numbers no
> > > > less, is worthwhile. After all, if the system id matches, it's not likely that
> > > > the xlog block or segment size differ.
> > >
> > > Hmm. I don't think we should remove those checks, as I can see people
> > > that would want to change their XLog block size with e.g.
> > > pg_reset_wal.
> >
> > I don't think that's something we need to address in every physical
> > segment. For one, there's no option to do so. But more importantly, if they
> > don't change the xlog block size, we'll just accept random WAL as well. If
> > somebody goes to the trouble of writing a custom tool, they can live with the
> > consequences of that potentially causing breakage. Particularly if the checks
> > wouldn't meaningfully prevent that anyway.
>
> How about this idea: Put the system ID etc into the new record Robert
> is proposing for the redo point, and also into the checkpoint record,
> so that it's at both ends of the to-be-replayed range.

I think that's a very good idea.

> That just leaves the WAL segments in between. If you find yourself writing
> a new record that would go in the first usable byte of a segment, insert a
> new special system ID (etc) record that will be checked during replay.

I don't see how we can do that without incuring a lot of overhead though. This
determination would need to happen in ReserveXLogInsertLocation(), while
holding the spinlock. Which is one of the most contended bits of code in
postgres. The whole reason that we have this "byte pos" to LSN conversion
stuff is to make the spinlock-protected part of ReserveXLogInsertLocation() as
short as possible.

> For segments that start with XLP_FIRST_IS_CONTRECORD, don't worry about it:
> those already form part of a chain of verification (xlp_rem_len, xl_crc)
> that started on the preceding page, so it seems almost impossible to
> accidentally replay from a segment that came from another system.

But I think we might just be ok with logic similar to this, even for the
non-contrecord case. If recovery starts in one segment where we have verified
sysid, xlog block size etc and we encounter a WAL record starting on the first
"content byte" of a segment, we can still verify that the prev LSN is correct
etc. Sure, if you try hard you could come up with a scenario where you could
mislead such a check, but we don't need to protect against intentional malice
here, just against accidents.

Greetings,

Andres Freund

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message David Rowley 2023-10-11 02:49:38 Re: Problem, partition pruning for prepared statement with IS NULL clause.
Previous Message Richard Guo 2023-10-11 01:59:59 Re: Retire has_multiple_baserels()