Re: [PATCH] Teach pg_waldump to extract FPIs from the WAL

From: David Christensen <david(dot)christensen(at)crunchydata(dot)com>
To: Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>
Cc: Justin Pryzby <pryzby(at)telsasoft(dot)com>, Ian Lawrence Barwick <barwick(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [PATCH] Teach pg_waldump to extract FPIs from the WAL
Date: 2022-11-10 16:22:28
Message-ID: CAOxo6XJR8OkQ7K-N9Ekq4ao+VJBPApYmP+yV4Cp+mbweHb_iLg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Nov 10, 2022 at 2:33 AM Bharath Rupireddy
<bharath(dot)rupireddyforpostgres(at)gmail(dot)com> wrote:
>
> On Thu, Nov 10, 2022 at 1:31 AM David Christensen
> <david(dot)christensen(at)crunchydata(dot)com> wrote:
> >
>
> Thanks for providing the v7 patch, please see my comments and responses below.

Hi Bharath, Thanks for the feedback.

> > > 2. I'm unable to understand the use-case for --fixup-fpi option.
> > > pg_waldump is supposed to be just WAL reader, and must not return any
> > > modified information, with --fixup-fpi option, the patch violates this
> > > principle i.e. it sets page LSN and returns. Without actually
> > > replaying the WAL record on the page, how is it correct to just set
> > > the LSN? How will it be useful? ISTM, we must ignore this option
> > > unless there's a strong use-case.
> >
> > How I was envisioning this was for cases like extreme surgery for
> > corrupted pages, where you extract the page from WAL but it has lsn
> > and checksum set so you could do something like `dd if=fixup-block
> > of=relation ...`, so it *simulates* the replay of said fullpage blocks
> > in cases where for some reason you can't play the intermediate
> > records; since this is always a fullpage block, it's capturing what
> > would be the snapshot so you could manually insert somewhere as needed
> > without needing to replay (say if dealing with an incomplete or
> > corrupted WAL stream).
>
> Recovery sets the page LSN after it replayed the WAL record on the
> page right? Recovery does this - base_page/FPI +
> apply_WAL_record_and_then_set_applied_WAL_record's_LSN =
> new_version_of_page. Essentially, in your patch, you are just setting
> the WAL record LSN with the page contents being the base page's. I'm
> still not sure what's the real use-case here. We don't have an
> independent function in postgres, given a base page and a WAL record
> that just replays the WAL record and output's the new version of the
> page, so I think what you do in the patch with fixup option seems
> wrong to me.

Well if it's not the same output then I guess you're right and there's
not a use for the `--fixup` mode. By the same token, I'd say
calculating/setting the checksum also wouldn't need to be done, we
should just include the page as included in the WAL stream.

> > > 5.
> > > + if (!RestoreBlockImage(record, block_id, page))
> > > + continue;
> > > +
> > > + /* we have our extracted FPI, let's save it now */
> > > After extracting the page from the WAL record, do we need to perform a
> > > checksum on it?
>
> I think you just need to do the following, this will ensure the
> authenticity of the page that pg_waldump returns.
> if ((PageHeader) page)->pd_checksum != pg_checksum_page((char *) page, blk))
> {
> pg_fatal("page checksum failed");
> }

The WAL already has a checksum, so not certain this makes sense on its
own. Also I'm inclined to make it a warning if it doesn't match rather
than a fatal. (I'd also have to verify that the checksum is properly
set on the page prior to copying the FPI into WAL, which I'm pretty
sure it is but not certain.)

> > > case 'W':
> > > config.save_fpw_path = pg_strdup(optarg);
> > > case 'X':
> > > config.fixup_fpw = true;
> > > config.save_fpw_path = pg_strdup(optarg);
> >
> > Like separate opt processing with their own `break` statement?
> > Probably a bit more readable/conventional.
>
> Yes.

Moot with the removal of the --fixup mode.

> Some more comments:
>
> 1.
> + PGAlignedBlock zerobuf;
> Essentially, it's not a zero buffer, please rename the variable to
> something like 'buf' or 'page_buf' or someother?

Sure.

> 2.
> + if (pg_pwrite(fileno(OPF), page, BLCKSZ, 0) != BLCKSZ)
> Replace pg_pwrite with fwrite() and avoid fileno() system calls that
> should suffice here, AFICS, we don't need pg_pwrite.

Sure.

> 3.
> + if (config.save_fpw_path != NULL)
> + {
> + /* Create the dir if it doesn't exist */
> + if (pg_mkdir_p(config.save_fpw_path, pg_dir_create_mode) < 0)
> I think you still need pg_check_dir() here, how about something like below?

I was assuming pg_mkdir_p() acted just like mkdir -p, where it's just
an idempotent action, so an existing dir is just treated the same.
What's the benefit here? Would assume if a non-dir file existed at
that path or other permissions issues arose we'd just get an error
from pg_mkdir_p(). (Will review the code there and confirm.)

> 4.
> + /* Create the dir if it doesn't exist */
> + if (pg_mkdir_p(config.save_fpw_path, pg_dir_create_mode) < 0)
> + {
> + pg_log_error("could not create output directory \"%s\": %m",
> + config.save_fpw_path);
> + goto bad_argument;
> Why is the directory creation error a bad_argument? I think you need
> just pg_fatal() here.

Sure. Was just following the other patterns I'd seen for argument handling.

> 5.
> + fsync(fileno(OPF));
> + fclose(OPF);
> I think just the fsync() isn't enough, you still need fsync_fname()
> and/or fsync_parent_path(), perhaps after for (block_id = 0; block_id
> <= XLogRecMaxBlockId(record); block_id++) loop.

I'm not sure I get the value of the fsyncs here; if you are using this
tool at this capacity you're by definition doing some sort of
transient investigative steps. Since the WAL was fsync'd, you could
always rerun/recreate as needed in the unlikely event of an OS crash
in the middle of this investigation. Since this is outside the
purview of the database operations proper (unlike, say, initdb) seems
like it's unnecessary (or definitely shouldn't need to be selectable).
My thoughts are that if we're going to fsync, just do the fsyncs
unconditionally rather than complicate the interface further.

> 6. Speaking of which, do we need to do fsync()'s optionally? If we
> were to write many such FPI files, aren't there going to be more
> fsync() calls and imagine this feature being used on a production
> server and a lot of fsync() will definitely make running server
> fsync() ops slower. I think we need a new option whether pg_waldump
> ever do fsync() or not, something similar to --no-sync of
> pg_receivewal/pg_upgrade/pg_dump/pg_initdb/pg_checksums etc. I would
> like it if the pg_waldump's --no-sync is coded as 0001 and 0002 can
> make use of it.

See my thoughts on #5; basically I'm -0.5 on the fsyncs at all.

> 7.
> + pg_fatal("couldn't write out complete fullpage image to
> file: %s", filename);
> We typically use "full page image" in the output strings, please correct.

Sure.

> 8.
> +
> + if (((PageHeader) page)->pd_checksum)
> + ((PageHeader) page)->pd_checksum =
> pg_checksum_page((char *) page, blk);
> Why do you need to set the page's checksum by yourself? I don't think
> this is the right way, pg_waldump should just return what it sees in
> the WAL record, of course, after verifying a few checks (like checksum
> is correct or not), but it mustn't set or compute anything new in the
> returned page.

This was in the --fixup codepath only, so will go away.

> Few comments on the tests:
> 1.
> +$primary->init('-k');
> +$primary->append_conf('postgresql.conf', "max_wal_size='100MB'");
> +$primary->append_conf('postgresql.conf', "wal_level='replica'");
> +$primary->append_conf('postgresql.conf', 'archive_mode=on');
> +$primary->append_conf('postgresql.conf', "archive_command='/bin/false'");
> +$primary->start;
> I don't think we need these many append_conf calls here, see how
> others are doing it with just one single call:
>
> $node->append_conf(
> 'postgresql.conf', qq{
> listen_addresses = '$hostaddr'
> krb_server_keyfile = '$keytab'
> log_connections = on
> lc_messages = 'C'
> });

Kk.

> 2.
> +$primary->append_conf('postgresql.conf', "max_wal_size='100MB'");
> Do you really need 100MB max_wal_size? Why can't you just initialize
> your cluster with 1MB wal files instead of 16MB and set max_wal_size
> to 4MB or a bit more, something like 019_replslot_limit.pl does?

Yeah, that could work; I wasn't aware that other tests were modifying
those params. In order to get a test that wouldn't barf when it hit
the end of the WAL stream (and so fail the test) I needed to ensure
there were multiple WAL files generated that would not be recycled so
I could dump a single WAL file without error. This was the approach I
was able to come up with. :-)

> 3.
> +# generate data/wal to examine
> +$primary->safe_psql('postgres', q(CREATE DATABASE db1));
> +$primary->safe_psql('db1', <<EOF);
> +CREATE TABLE test_table AS SELECT generate_series(1,100) a;
> +CHECKPOINT;
> +SELECT pg_switch_wal();
> +UPDATE test_table SET a = a + 1;
> +CHECKPOINT;
> +SELECT pg_switch_wal();
> +UPDATE test_table SET a = a + 1;
> +CHECKPOINT;
> +SELECT pg_switch_wal();
> +EOF
> I don't think you need these many complex things to generate WAL,
> multiple CHECKPOINT;s can make tests slower. To keep it simple, you
> can just create a table, insert a single row, checkpoint, update the
> row, switch the wal - no need to test if your feature generates
> multiple WAL files, it's enough to test if it generates just one.
> Please simplfiy the tests.

Can probably simplify more, but see the rationale on my last point.

> 4.
> +$primary->append_conf('postgresql.conf', "wal_level='replica'");
> +$primary->append_conf('postgresql.conf', 'archive_mode=on');
> +$primary->append_conf('postgresql.conf', "archive_command='/bin/false'");
> Why do you need to set wal_level to replica, out of the box your
> cluster comes with replica only no?
> And why do you need archive_mode on and set the command to do nothing?
> Why archiving is needed for testing your feature firstly?

I think it had shown "minimal" in my testing; I was purposefully
failing archives so the WAL would stick around. Maybe a custom
archive command that just copied a single WAL file into a known
location so I could use that instead of the current approach would
work, though not sure how Windows support would work with that. Open
to other ideas to more cleanly get a single WAL file that isn't the
last one. (Earlier versions of this test were using /all/ of the
generated WAL files rather than a single one, so maybe I am
overcomplicating things for a single WAL file case.)

> 5.
> +my $primary = PostgreSQL::Test::Cluster->new('primary');
> Can you rename your node to other than primary? Because this isn't a
> test of replication where primary and standby nodes get created. How
> about just 'node'?

Sure, np.

Best,

David

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2022-11-10 17:37:38 Re: Ability to reference other extensions by schema in extension scripts
Previous Message Aleksander Alekseev 2022-11-10 15:40:44 Re: Unit tests for SLRU