Re: [PATCH] Teach pg_waldump to extract FPIs from the WAL

From: Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>
To: David Christensen <david(dot)christensen(at)crunchydata(dot)com>
Cc: Justin Pryzby <pryzby(at)telsasoft(dot)com>, Ian Lawrence Barwick <barwick(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [PATCH] Teach pg_waldump to extract FPIs from the WAL
Date: 2022-11-11 10:57:14
Message-ID: CALj2ACWTQCYGHf=2pHQeX=8qfkMi-4JsXN0t_6W8+A-cW8wJpw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Nov 10, 2022 at 9:52 PM David Christensen
<david(dot)christensen(at)crunchydata(dot)com> wrote:
>
> > > > 2. I'm unable to understand the use-case for --fixup-fpi option.
> > > > pg_waldump is supposed to be just WAL reader, and must not return any
> > > > modified information, with --fixup-fpi option, the patch violates this
> > > > principle i.e. it sets page LSN and returns. Without actually
> > > > replaying the WAL record on the page, how is it correct to just set
> > > > the LSN? How will it be useful? ISTM, we must ignore this option
> > > > unless there's a strong use-case.
> > >
> > > How I was envisioning this was for cases like extreme surgery for
> > > corrupted pages, where you extract the page from WAL but it has lsn
> > > and checksum set so you could do something like `dd if=fixup-block
> > > of=relation ...`, so it *simulates* the replay of said fullpage blocks
> > > in cases where for some reason you can't play the intermediate
> > > records; since this is always a fullpage block, it's capturing what
> > > would be the snapshot so you could manually insert somewhere as needed
> > > without needing to replay (say if dealing with an incomplete or
> > > corrupted WAL stream).
> >
> > Recovery sets the page LSN after it replayed the WAL record on the
> > page right? Recovery does this - base_page/FPI +
> > apply_WAL_record_and_then_set_applied_WAL_record's_LSN =
> > new_version_of_page. Essentially, in your patch, you are just setting
> > the WAL record LSN with the page contents being the base page's. I'm
> > still not sure what's the real use-case here. We don't have an
> > independent function in postgres, given a base page and a WAL record
> > that just replays the WAL record and output's the new version of the
> > page, so I think what you do in the patch with fixup option seems
> > wrong to me.
>
> Well if it's not the same output then I guess you're right and there's
> not a use for the `--fixup` mode. By the same token, I'd say
> calculating/setting the checksum also wouldn't need to be done, we
> should just include the page as included in the WAL stream.

Let's hear from others, we may be missing something here. I recommend
keeping the --fixup patch as 0002, in case if we decide to discard
it's easier, however I'll leave that to you.

> > > > 5.
> > > > + if (!RestoreBlockImage(record, block_id, page))
> > > > + continue;
> > > > +
> > > > + /* we have our extracted FPI, let's save it now */
> > > > After extracting the page from the WAL record, do we need to perform a
> > > > checksum on it?
> >
> > I think you just need to do the following, this will ensure the
> > authenticity of the page that pg_waldump returns.
> > if ((PageHeader) page)->pd_checksum != pg_checksum_page((char *) page, blk))
> > {
> > pg_fatal("page checksum failed");
> > }
>
> The WAL already has a checksum, so not certain this makes sense on its
> own. Also I'm inclined to make it a warning if it doesn't match rather
> than a fatal. (I'd also have to verify that the checksum is properly
> set on the page prior to copying the FPI into WAL, which I'm pretty
> sure it is but not certain.)

How about having it as an Assert()?

> > 5.
> > + fsync(fileno(OPF));
> > + fclose(OPF);
> > I think just the fsync() isn't enough, you still need fsync_fname()
> > and/or fsync_parent_path(), perhaps after for (block_id = 0; block_id
> > <= XLogRecMaxBlockId(record); block_id++) loop.
>
> I'm not sure I get the value of the fsyncs here; if you are using this
> tool at this capacity you're by definition doing some sort of
> transient investigative steps. Since the WAL was fsync'd, you could
> always rerun/recreate as needed in the unlikely event of an OS crash
> in the middle of this investigation. Since this is outside the
> purview of the database operations proper (unlike, say, initdb) seems
> like it's unnecessary (or definitely shouldn't need to be selectable).
> My thoughts are that if we're going to fsync, just do the fsyncs
> unconditionally rather than complicate the interface further.

-1 for fysnc() per file created as it can create a lot of sync load on
production servers impacting performance. How about just syncing the
directory at the end assuming it doesn't cost as much as fsync() per
FPI file created would?

> > 4.
> > +$primary->append_conf('postgresql.conf', "wal_level='replica'");
> > +$primary->append_conf('postgresql.conf', 'archive_mode=on');
> > +$primary->append_conf('postgresql.conf', "archive_command='/bin/false'");
> > Why do you need to set wal_level to replica, out of the box your
> > cluster comes with replica only no?
> > And why do you need archive_mode on and set the command to do nothing?
> > Why archiving is needed for testing your feature firstly?
>
> I think it had shown "minimal" in my testing; I was purposefully
> failing archives so the WAL would stick around. Maybe a custom
> archive command that just copied a single WAL file into a known
> location so I could use that instead of the current approach would
> work, though not sure how Windows support would work with that. Open
> to other ideas to more cleanly get a single WAL file that isn't the
> last one. (Earlier versions of this test were using /all/ of the
> generated WAL files rather than a single one, so maybe I am
> overcomplicating things for a single WAL file case.)

Typically we create a physical replication slot at the beginning so
that the server keeps the WAL required for you in pg_wal itself, for
instance, please see pg_walinspect:

-- Make sure checkpoints don't interfere with the test.
SELECT 'init' FROM
pg_create_physical_replication_slot('regress_pg_walinspect_slot',
true, false);

> > 2.
> > +$primary->append_conf('postgresql.conf', "max_wal_size='100MB'");
> > Do you really need 100MB max_wal_size? Why can't you just initialize
> > your cluster with 1MB wal files instead of 16MB and set max_wal_size
> > to 4MB or a bit more, something like 019_replslot_limit.pl does?
>
> Yeah, that could work; I wasn't aware that other tests were modifying
> those params. In order to get a test that wouldn't barf when it hit
> the end of the WAL stream (and so fail the test) I needed to ensure
> there were multiple WAL files generated that would not be recycled so
> I could dump a single WAL file without error. This was the approach I
> was able to come up with. :-)

If you do trick specified in the above comment i.e. using replication
slot to hold the WAL, I think you don't need to set max_wal_size at
all, if that's true, I think the tests can just spin up a node and run
the tests without bothering max_wal_size, archive_mode etc.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Richard Guo 2022-11-11 11:28:02 A problem about join ordering
Previous Message Konstantin Knizhnik 2022-11-11 10:43:13 Re: Lack of PageSetLSN in heap_xlog_visible