Re: Throttling WAL inserts when the standby falls behind more than the configured replica_lag_in_bytes

From: Ashwin Agrawal <ashwinstar(at)gmail(dot)com>
To: SATYANARAYANA NARLAPURAM <satyanarlapuram(at)gmail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Throttling WAL inserts when the standby falls behind more than the configured replica_lag_in_bytes
Date: 2022-01-03 18:55:06
Message-ID: CAKSySwfaXPtmGiJ_m9tmVGpuK9-VQ3T_j=wLuKd-tuo=UCCSnA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Dec 22, 2021 at 4:23 PM SATYANARAYANA NARLAPURAM <
satyanarlapuram(at)gmail(dot)com> wrote:

> Hi Hackers,
>
> I am considering implementing RPO (recovery point objective) enforcement
> feature for Postgres where the WAL writes on the primary are stalled when
> the WAL distance between the primary and standby exceeds the configured
> (replica_lag_in_bytes) threshold. This feature is useful particularly in
> the disaster recovery setups where primary and standby are in different
> regions and synchronous replication can't be set up for latency and
> performance reasons yet requires some level of RPO enforcement.
>
> The idea here is to calculate the lag between the primary and the standby
> (Async?) server during XLogInsert and block the caller until the lag is
> less than the threshold value. We can calculate the max lag by iterating
> over ReplicationSlotCtl->replication_slots. If this is not something we
> don't want to do in the core, at least adding a hook for XlogInsert is of
> great value.
>
> A few other scenarios I can think of with the hook are:
>
> 1. Enforcing RPO as described above
> 2. Enforcing rate limit and slow throttling when sync standby is
> falling behind (could be flush lag or replay lag)
> 3. Transactional log rate governance - useful for cloud providers to
> provide SKU sizes based on allowed WAL writes.
>
> Thoughts?
>

Very similar requirement or need was discussed in the past in [1], not
exactly RPO enforcement but large bulk operation/transaction negatively
impacting concurrent transactions due to replication lag.
Would be good to refer to that thread as it explains the challenges for
implementing functionality mentioned in this thread. Mostly the challenge
being no common place to code the throttling logic instead requiring calls
to be sprinkled around in various parts.

1]
https://www.postgresql.org/message-id/flat/CA%2BU5nMLfxBgHQ1VLSeBHYEMjHXz_OHSkuFdU6_1quiGM0RNKEg%40mail.gmail.com

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2022-01-03 18:55:35 Re: TYPCATEGORY_{NETWORK,USER} [was Dubious usage of TYPCATEGORY_STRING]
Previous Message Nikhil Benesch 2022-01-03 18:54:58 Re: Remove inconsistent quotes from date_part error