Re: pg_receivewal starting position

From: Ronan Dunklau <ronan(dot)dunklau(at)aiven(dot)io>
To: Michael Paquier <michael(at)paquier(dot)xyz>
Cc: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, bharath(dot)rupireddyforpostgres(at)gmail(dot)com, pgsql-hackers(at)lists(dot)postgresql(dot)org, sawada(dot)mshk(at)gmail(dot)com
Subject: Re: pg_receivewal starting position
Date: 2021-10-28 13:55:12
Message-ID: 3489026.hdfAi7Kttb@aivenronan
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Le jeudi 28 octobre 2021, 14:31:36 CEST Michael Paquier a écrit :
> On Wed, Oct 27, 2021 at 10:11:00AM +0200, Ronan Dunklau wrote:
> > Sorry I sent an intermediary version of the patch, here is the correct
> > one.
>
> While looking at this patch, I have figured out a simple way to make
> the tests faster without impacting the coverage. The size of the WAL
> segments archived is a bit of a bottleneck as they need to be zeroed
> by pg_receivewal at creation. This finishes by being a waste of time,
> even if we don't flush the data. So I'd like to switch the test so as
> we use a segment size of 1MB, first.
>
> A second thing is that we copy too many segments than really needed
> when using the slot's restart_lsn as starting point as RESERVE_WAL
> would use the current redo location, so it seems to me that a
> checkpoint is in order before the slot creation. A third thing is
> that generating some extra data after the end LSN we want to use makes
> the test much faster at the end.
>
> With those three methods combined, the test goes down from 11s to 9s
> here. Attached is a patch I'd like to apply to make the test
> cheaper.

Interesting ideas, thanks. For the record, the time drops from ~4.5s to 3s on
average on my machine.
I think if you reduce the size of the generate_series batches, this should
probably be reduced everywhere. With what we do though, inserting a single
line should work just as well, I wonder why we insist on inserting a hundred
lines ? I updated your patch with that small modification, it also makes the
code less verbose.

>
> I also had a look at your patch. Here are some comments.
>
> +# Cleanup the previous stream directories to reuse them
> +unlink glob "'${stream_dir}/*'";
> +unlink glob "'${slot_dir}/*'";
> I think that we'd better choose a different location for the
> archives. Keeping the segments of the previous tests is useful for
> debugging if a previous step of the test failed.

Ok.
>
> +$standby->psql('',
> + "CREATE_REPLICATION_SLOT $folder_slot PHYSICAL (RESERVE_WAL)",
> + replication => 1);
> Here as well we could use a restart point to reduce the number of
> segments archived.

The restart point should be very close, as we don't generate any activity on
the primary between the backup and the slot's creation. I'm not sure adding
the complexity of triggering a checkpoint on the primary and waiting for the
standby to catch up on it would be that useful.
>
> +# Now, try to resume after the promotion, from the folder.
> +$standby->command_ok(
> + [ 'pg_receivewal', '-D', $stream_dir, '--verbose', '--endpos', $nextlsn,
> + '--slot', $folder_slot, '--no-sync'],
> + "Stream some wal after promoting, resuming from the folder's position");
> What is called the "resume-from-folder" test in the patch is IMO
> too costly and brings little extra value, requiring two commands of
> pg_receivewal (one to populate the folder and one to test the TLI jump
> from the previously-populated point) to test basically the same thing
> as when the starting point is taken from the slot. Except that
> restarting from the slot is twice cheaper. The point of the
> resume-from-folder case is to make sure that we restart from the point
> of the archive folder rather than the slot's restart_lsn, but your
> test fails this part, in fact, because the first command populating
> the archive folder also uses "--slot $folder_slot", updating the
> slot's restart_lsn before the second pg_receivewal uses this slot
> again.
>
> I think, as a whole, that testing for the case where an archive folder
> is populated (without a slot!), followed by a second command where we
> use a slot that has a restart_lsn older than the archive's location,
> to not be that interesting. If we include such a test, there is no
> need to include that within the TLI jump part, in my opinion. So I
> think that we had better drop this part of the patch, and keep only
> the case where we resume from a slot for the TLI jump.

You're right about the test not being that interesting in that case. I thought
it would be worthwhile to test both cases, but resume_from_folder doesn't
actually exercise any code that was not already called before (timeline
computation from segment name).

> The commands of pg_receivewal included in the test had better use -n
> so as there is no infinite loop on failure.

Ok.

--
Ronan Dunklau

Attachment Content-Type Size
v16-0002-Speed-up-pg_receivewal-tests.patch text/x-patch 3.3 KB
v16-0001-Add-a-test-for-pg_receivewal-following-timeline-.patch text/x-patch 3.7 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message osumi.takamichi@fujitsu.com 2021-10-28 14:18:47 RE: Failed transaction statistics to measure the logical replication progress
Previous Message Arne Roland 2021-10-28 13:44:31 Re: missing indexes in indexlist with partitioned tables