Re: Switching XLog source from archive to streaming when primary available

From: Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>
To: Cary Huang <cary(dot)huang(at)highgo(dot)ca>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, SATYANARAYANA NARLAPURAM <satyanarlapuram(at)gmail(dot)com>
Subject: Re: Switching XLog source from archive to streaming when primary available
Date: 2022-08-11 15:38:23
Message-ID: CALj2ACUYz1z6QPduGn5gguCkfd-ko44j4hKcOMtp6fzv9xEWgw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Jul 8, 2022 at 9:16 PM Bharath Rupireddy
<bharath(dot)rupireddyforpostgres(at)gmail(dot)com> wrote:
>
> On Sat, Jun 25, 2022 at 1:31 AM Cary Huang <cary(dot)huang(at)highgo(dot)ca> wrote:
> >
> > The following review has been posted through the commitfest application:
> > make installcheck-world: tested, passed
> > Implements feature: tested, passed
> > Spec compliant: not tested
> > Documentation: not tested
> >
> > Hello
> >
> > I tested this patch in a setup where the standby is in the middle of replicating and REDOing primary's WAL files during a very large data insertion. During this time, I keep killing the walreceiver process to cause a stream failure and force standby to read from archive. The system will restore from archive for "wal_retrieve_retry_interval" seconds before it attempts to steam again. Without this patch, once the streaming is interrupted, it keeps reading from archive until standby reaches the same consistent state of primary and then it will switch back to streaming again. So it seems that the patch does the job as described and does bring some benefit during a very large REDO job where it will try to re-stream after restoring some WALs from archive to speed up this "catch up" process. But if the recovery job is not a large one, PG is already switching back to streaming once it hits consistent state.
>
> Thanks a lot Cary for testing the patch.
>
> > Here's a v1 patch that I've come up with. I'm right now using the
> > existing GUC wal_retrieve_retry_interval to switch to stream mode from
> > archive mode as opposed to switching only after the failure to get WAL
> > from archive mode. If okay with the approach, I can add tests, change
> > the docs and add a new GUC to control this behaviour. I'm open to
> > thoughts and ideas here.
>
> It will be great if I can hear some thoughts on the above points (as
> posted upthread).

Here's the v2 patch with a separate GUC, new GUC was necessary as the
existing GUC wal_retrieve_retry_interval is loaded with multiple
usages. When the feature is enabled, it will let standby to switch to
stream mode i.e. fetch WAL from primary before even fetching from
archive fails. The switching to stream mode from archive happens in 2
scenarios: 1) when standby is in initial recovery 2) when there was a
failure in receiving from primary (walreceiver got killed or crashed
or timed out, or connectivity to primary was broken - for whatever
reasons).

I also added test cases to the v2 patch.

Please review the patch.

--
Bharath Rupireddy
RDS Open Source Databases: https://aws.amazon.com/rds/postgresql/

Attachment Content-Type Size
v2-0001-Switch-WAL-source-to-stream-from-archive.patch application/octet-stream 14.8 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bharath Rupireddy 2022-08-11 16:12:18 Use SetInstallXLogFileSegmentActive() for setting XLogCtl->InstallXLogFileSegmentActive
Previous Message Tom Lane 2022-08-11 15:28:41 Re: tests and meson - test names and file locations