Re: Minimal logical decoding on standbys

From: "Drouvot, Bertrand" <bertranddrouvot(dot)pg(at)gmail(dot)com>
To: Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Robert Haas <robertmhaas(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Ibrar Ahmed <ibrar(dot)ahmad(at)gmail(dot)com>, Amit Khandekar <amitdkhan(dot)pg(at)gmail(dot)com>, fabriziomello(at)gmail(dot)com, tushar <tushar(dot)ahuja(at)enterprisedb(dot)com>, Rahila Syed <rahila(dot)syed(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Minimal logical decoding on standbys
Date: 2023-01-11 15:23:34
Message-ID: d7547f2c-a0c3-6aad-b631-b7ed5efaf298@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 1/11/23 8:32 AM, Bharath Rupireddy wrote:
> On Tue, Jan 10, 2023 at 2:03 PM Drouvot, Bertrand
> <bertranddrouvot(dot)pg(at)gmail(dot)com> wrote:
>>
>> Please find attached, V37 taking care of:
>
> Thanks. I started to digest the design specified in the commit message
> and these patches.

Thanks for looking at it!

> Here are some quick comments:
>
> 1. Does logical decoding on standby work without any issues if the
> standby is set for cascading replication?
>

Without "any issues" is hard to guarantee ;-) But according to my tests:

Primary -> Standby1 with or without logical replication slot -> Standby2 with or without logical replication slot

works as expected (and also with cascading promotion).
We can add some TAP tests in 0004 though.

> 2. Do logical decoding output plugins work without any issues on the
> standby with decoding enabled, say, when the slot is invalidated?
>

Not sure, I got the question.
If the slot is invalidated then it's expected to get errors like:

pg_recvlogical: error: unexpected termination of replication stream: ERROR: canceling statement due to conflict with recovery
DETAIL: User was using the logical slot that must be dropped.

or

pg_recvlogical: error: could not send replication command "START_REPLICATION SLOT "bdt_slot" LOGICAL 0/0": ERROR: cannot read from logical replication slot "bdt_slot"
DETAIL: This slot has been invalidated because it was conflicting with recovery.

> 3. Is this feature still a 'minimal logical decoding on standby'?
> Firstly, why is it 'minimal'?
>

Good question and I don't have the answer.
That's how it has been named when this thread started back in 2018.

> 4. What happens in case of failover to the standby that's already
> decoding for its clients? Will the decoding work seamlessly? If not,
> what are recommended things that users need to take care of
> during/after failovers?

Yes, it's expected to work seamlessly. There is a TAP test in
0004 for this scenario.

>
> 0002:
> 1.
> - if (InvalidateObsoleteReplicationSlots(_logSegNo))
> + InvalidateObsoleteOrConflictingLogicalReplicationSlots(_logSegNo,
> &invalidated, InvalidOid, NULL);
>
> Isn't the function name too long and verbose? How about just
> InvalidateLogicalReplicationSlots()

The function also takes care of Invalidation of Physical replication slots
that are Obsolete (aka LSN case).

InvalidateObsoleteOrConflictingReplicationSlots() maybe?

> let the function comment talk
> about what sorts of replication slots it invalides?

Agree to make the comment more clear.

>
> 2.
> + errdetail("Logical decoding on
> standby requires wal_level to be at least logical on master"));
> + * master wal_level is set back to replica, so existing logical
> slots need to
> invalidate such slots. Also do the same thing if wal_level on master
>
> Can we use 'primary server' instead of 'master' like elsewhere? This
> comment also applies for other patches too, if any.
>

Sure.

> 3. Can we show a new status in pg_get_replication_slots's wal_status
> for invalidated due to the conflict so that the user can monitor for
> the new status and take necessary actions?
>

Not sure you've seen but the patch series is adding a new field (confl_active_logicalslot) in
pg_stat_database_conflicts.

That said, I like your idea about adding a new status in pg_replication_slots too.
Do you think it's mandatory for this patch series? (I mean it could be added once this patch series is committed).

I'm asking because this patch series looks already like a "big" one, is more than 4 years old
and I'm afraid of adding more "reporting" stuff to it (unless we feel a strong need for it of course).

> 4. How will the user be notified when logical replication slots are
> invalidated due to conflict with the primary server?

Emitting messages, like the ones mentioned above introduced in 0002.

> How can they
> (firstly, is there a way?) repair/restore such replication slots? Or
> is recreating/reestablishing logical replication only the way out for
> them?

Drop/recreate is what is part of the current design and discussed up-thread IIRC.

> What users can do to avoid their logical replication slots
> getting invalidated and run into these situations? Because
> recreating/reestablishing logical replication comes with cost
> (sometimes huge) as it involves building another instance, syncing
> tables etc. Isn't it a good idea to touch up on all these aspects in
> the documentation at least as to why we're doing this and why we can't
> do this?
>

0005 adds a few words about it:

+ <para>
+ A logical replication slot can also be created on a hot standby. To prevent
+ <command>VACUUM</command> from removing required rows from the system
+ catalogs, <varname>hot_standby_feedback</varname> should be set on the
+ standby. In spite of that, if any required rows get removed, the slot gets
+ invalidated. It's highly recommended to use a physical slot between the primary
+ and the standby. Otherwise, hot_standby_feedback will work, but only while the
+ connection is alive (for example a node restart would break it). Existing
+ logical slots on standby also get invalidated if wal_level on primary is reduced to
+ less than 'logical'.
+ </para>

> 5.
> @@ -1253,6 +1253,14 @@ StartLogicalReplication(StartReplicationCmd *cmd)
>
> ReplicationSlotAcquire(cmd->slotname, true);
>
> + if (!TransactionIdIsValid(MyReplicationSlot->data.xmin)
> + && !TransactionIdIsValid(MyReplicationSlot->data.catalog_xmin))
> + ereport(ERROR,
> + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> + errmsg("cannot read from logical replication slot \"%s\"",
> + cmd->slotname),
> + errdetail("This slot has been invalidated because it
> was conflicting with recovery.")));
>
> Having the invalidation check in StartLogicalReplication() looks fine,
> however, what happens if the slot gets invalidated when the
> replication is in-progress? Do we need to error out in WalSndLoop() or
> XLogSendLogical() too? Or is it already taken care of somewhere?
>

Yes, it's already taken care in pg_logical_slot_get_changes_guts().

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Eisentraut 2023-01-11 15:32:57 What object types should be in schemas?
Previous Message Noah Misch 2023-01-11 15:16:55 Re: allowing for control over SET ROLE