Re: Introduce XID age and inactive timeout based replication slot invalidation

From: Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>
To: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
Cc: Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, shveta malik <shveta(dot)malik(at)gmail(dot)com>, Nathan Bossart <nathandbossart(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Introduce XID age and inactive timeout based replication slot invalidation
Date: 2024-04-02 06:28:40
Message-ID: ZgulmKzdqxuBuhCU@ip-10-97-1-34.eu-west-3.compute.internal
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On Tue, Apr 02, 2024 at 12:07:54PM +0900, Masahiko Sawada wrote:
> On Mon, Apr 1, 2024 at 12:18 PM Bharath Rupireddy
>
> FWIW, coming to this thread late, I think that the inactive_since
> should not be synchronized from the primary. The wall clocks are
> different on the primary and the standby so having the primary's
> timestamp on the standby can confuse users, especially when there is a
> big clock drift. Also, as Amit mentioned, inactive_since seems not to
> be consistent with other parameters we copy. The
> replication_slot_inactive_timeout feature should work on the standby
> independent from the primary, like other slot invalidation mechanisms,
> and it should be based on its own local clock.

Thanks for sharing your thoughts! So, it looks like that most of us agree to not
sync inactive_since from the primary, I'm fine with that.

> If we want to invalidate the synced slots due to the timeout, I think
> we need to define what is "inactive" for synced slots.
>
> Suppose that the slotsync worker updates the local (synced) slot's
> inactive_since whenever releasing the slot, irrespective of the actual
> LSNs (or other slot parameters) having been updated. I think that this
> idea cannot handle a slot that is not acquired on the primary. In this
> case, the remote slot is inactive but the local slot is regarded as
> active. WAL files are piled up on the standby (and on the primary) as
> the slot's LSNs don't move forward. I think we want to regard such a
> slot as "inactive" also on the standby and invalidate it because of
> the timeout.

I think that makes sense to somehow link inactive_since on the standby to
the actual LSNs (or other slot parameters) being updated or not.

> > > Now, the other concern is that calling GetCurrentTimestamp()
> > > could be costly when the values for the slot are not going to be
> > > updated but if that happens we can optimize such that before acquiring
> > > the slot we can have some minimal pre-checks to ensure whether we need
> > > to update the slot or not.
>
> If we use such pre-checks, another problem might happen; it cannot
> handle a case where the slot is acquired on the primary but its LSNs
> don't move forward. Imagine a logical replication conflict happened on
> the subscriber, and the logical replication enters the retry loop. In
> this case, the remote slot's inactive_since gets updated for every
> retry, but it looks inactive from the standby since the slot LSNs
> don't change. Therefore, only the local slot could be invalidated due
> to the timeout but probably we don't want to regard such a slot as
> "inactive".
>
> Another idea I came up with is that the slotsync worker updates the
> local slot's inactive_since to the local timestamp only when the
> remote slot might have got inactive. If the remote slot is acquired by
> someone, the local slot's inactive_since is also NULL. If the remote
> slot gets inactive, the slotsync worker sets the local timestamp to
> the local slot's inactive_since. Since the remote slot could be
> acquired and released before the slotsync worker gets the remote slot
> data again, if the remote slot's inactive_since > the local slot's
> inactive_since, the slotsync worker updates the local one.

Then I think we would need to be careful about time zone comparison.

> IOW, we
> detect whether the remote slot was acquired and released since the
> last synchronization, by checking the remote slot's inactive_since.
> This idea seems to handle these cases I mentioned unless I'm missing
> something, but it requires for the slotsync worker to update
> inactive_since in a different way than other parameters.
>
> Or a simple solution is that the slotsync worker updates
> inactive_since as it does for non-synced slots, and disables
> timeout-based slot invalidation for synced slots.

Yeah, I think the main question to help us decide is: do we want to invalidate
"inactive" synced slots locally (in addition to synchronizing the invalidation
from the primary)?

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Pavel Stehule 2024-04-02 06:46:16 Re: Schema variables - new implementation for Postgres 15
Previous Message Jeff Davis 2024-04-02 05:19:11 Re: Table AM Interface Enhancements