Re: How to simulate sync/async standbys being closer/farther (network distance) to primary in core postgres?

From: Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>
To: SATYANARAYANA NARLAPURAM <satyanarlapuram(at)gmail(dot)com>
Cc: Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: How to simulate sync/async standbys being closer/farther (network distance) to primary in core postgres?
Date: 2022-04-09 09:08:50
Message-ID: CALj2ACWd2fds-LagF=VfSgr9fQwTaByV40urNZjhpqvaa1F6dQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Apr 8, 2022 at 10:22 PM SATYANARAYANA NARLAPURAM
<satyanarlapuram(at)gmail(dot)com> wrote:
>
>> > <bharath(dot)rupireddyforpostgres(at)gmail(dot)com> wrote:
>> > >
>> > > Hi,
>> > >
>> > > I'm thinking if there's a way in core postgres to achieve $subject. In
>> > > reality, the sync/async standbys can either be closer/farther (which
>> > > means sync/async standbys can receive WAL at different times) to
>> > > primary, especially in cloud HA environments with primary in one
>> > > Availability Zone(AZ)/Region and standbys in different AZs/Regions.
>> > > $subject may not be possible on dev systems (say, for testing some HA
>> > > features) unless we can inject a delay in WAL senders before sending
>> > > WAL.
>
> Simulation will be helpful even for end customers to simulate faults in the production environments during availability zone/disaster recovery drills.

Right.

>> > > How about having two developer-only GUCs {async,
>> > > sync}_wal_sender_delay? When set, the async and sync WAL senders will
>> > > delay sending WAL by {async, sync}_wal_sender_delay
>> > > milliseconds/seconds? Although, I can't think of any immediate use, it
>> > > will be useful someday IMO, say for features like [1], if it gets in.
>> > > With this set of GUCs, one can even add core regression tests for HA
>> > > features.
>
> I would suggest doing this at the slot level, instead of two GUCs that control the behavior of all the slots (physical/logical). Something like "pg_suspend_replication_slot and pg_Resume_replication_slot"?

Having the control at the replication slot level seems reasonable
instead of at the WAL sender level. As there can be many slots on the
primary, we must have a way to specify which slots need to be delayed
and by how much time before sending WAL. If GUCs, they must be of list
types and I'm not sure that would come out well.

Instead, two (superuser-only/users with replication role) functions
such as pg_replication_slot_set_delay(slot_name,
delay_in_milliseconds)/pg_replication_slot_unset_delay(slot_name).
pg_replication_slot_set_delay will set ReplicationSlot->delay and the
WAL sender checks MyReplicationSlot->delay > 0 and waits before
sending WAL. pg_replication_slot_unset_delay will set
ReplicationSlot->delay to 0, or instead of
pg_replication_slot_unset_delay, the
pg_replication_slot_set_delay(slot_name, 0) can be used, this way only
single function.

If the users want a standby to receive WAL with a delay, they can use
pg_replication_slot_set_delay after creating the replication slot.

Thoughts?

> Alternatively a GUC on the standby side instead of primary so that the wal receiver stops responding to the wal sender?

I think we have wal_receiver_status_interval GUC on WAL receiver that
achieves the above i.e. not responding to the primary at all, one can
set wal_receiver_status_interval to, say, 1day.

[1]
{
{"wal_receiver_status_interval", PGC_SIGHUP, REPLICATION_STANDBY,
gettext_noop("Sets the maximum interval between WAL
receiver status reports to the sending server."),
NULL,
GUC_UNIT_S
},
&wal_receiver_status_interval,
10, 0, INT_MAX / 1000,
NULL, NULL, NULL
},

Regards,
Bharath Rupireddy.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Christoph Berg 2022-04-09 09:21:51 Re: How about a psql backslash command to show GUCs?
Previous Message Andres Freund 2022-04-09 05:05:01 Re: failures in t/031_recovery_conflict.pl on CI