Re: Switching XLog source from archive to streaming when primary available

From: Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>
To: SATYANARAYANA NARLAPURAM <satyanarlapuram(at)gmail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Switching XLog source from archive to streaming when primary available
Date: 2022-05-24 16:18:05
Message-ID: CALj2ACUk3Wc53Xy4HcivexXZw0DXGaGbbRznuK+cdePHdDLRRA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, Apr 30, 2022 at 6:19 PM Bharath Rupireddy
<bharath(dot)rupireddyforpostgres(at)gmail(dot)com> wrote:
>
> On Mon, Nov 29, 2021 at 1:30 AM SATYANARAYANA NARLAPURAM
> <satyanarlapuram(at)gmail(dot)com> wrote:
> >
> > Hi Hackers,
> >
> > When the standby couldn't connect to the primary it switches the XLog source from streaming to archive and continues in that state until it can get the WAL from the archive location. On a server with high WAL activity, typically getting the WAL from the archive is slower than streaming it from the primary and couldn't exit from that state. This not only increases the lag on the standby but also adversely impacts the primary as the WAL gets accumulated, and vacuum is not able to collect the dead tuples. DBAs as a mitigation can however remove/advance the slot or remove the restore_command on the standby but this is a manual work I am trying to avoid. I would like to propose the following, please let me know your thoughts.
> >
> > Automatically attempt to switch the source from Archive to streaming when the primary_conninfo is set after replaying 'N' wal segment governed by the GUC retry_primary_conn_after_wal_segments
> > when retry_primary_conn_after_wal_segments is set to -1 then the feature is disabled
> > When the retry attempt fails, then switch back to the archive
>
> I've gone through the state machine in WaitForWALToBecomeAvailable and
> I understand it this way: failed to receive WAL records from the
> primary causes the current source to switch to archive and the standby
> continues to get WAL records from archive location unless some failure
> occurs there the current source is never going to switch back to
> stream. Given the fact that getting WAL from archive location causes
> delay in production environments, we miss to take the advantage of the
> reconnection to primary after previous failed attempt.
>
> So basically, we try to attempt to switch to streaming from archive
> (even though fetching from archive can succeed) after a certain amount
> of time or WAL segments. I prefer timing-based switch to streaming
> from archive instead of after a number of WAL segments fetched from
> archive. Right now, wal_retrieve_retry_interval is being used to wait
> before switching to archive after failed attempt from streaming, IMO,
> a similar GUC (that gets set once the source switched from streaming
> to archive and on timeout it switches to streaming again) can be used
> to switch from archive to streaming after the specified amount of
> time.
>
> Thoughts?

Here's a v1 patch that I've come up with. I'm right now using the
existing GUC wal_retrieve_retry_interval to switch to stream mode from
archive mode as opposed to switching only after the failure to get WAL
from archive mode. If okay with the approach, I can add tests, change
the docs and add a new GUC to control this behaviour. I'm open to
thoughts and ideas here.

Regards,
Bharath Rupireddy.

Attachment Content-Type Size
v1-0001-Switch-to-stream-mode-from-archive-occasionally.patch application/octet-stream 4.8 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Ranier Vilela 2022-05-24 16:23:43 Re: Improving connection scalability (src/backend/storage/ipc/procarray.c)
Previous Message Robert Haas 2022-05-24 16:06:43 Re: Improving connection scalability (src/backend/storage/ipc/procarray.c)