Re: Introduce XID age and inactive timeout based replication slot invalidation

From: shveta malik <shveta(dot)malik(at)gmail(dot)com>
To: Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>
Cc: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Nathan Bossart <nathandbossart(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, shveta malik <shveta(dot)malik(at)gmail(dot)com>
Subject: Re: Introduce XID age and inactive timeout based replication slot invalidation
Date: 2024-04-03 03:08:10
Message-ID: CAJpy0uBthb-RG=Xg+mM-G6Xy01JLcBsizJuz+xbti08qP_W7AQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Apr 2, 2024 at 11:58 AM Bertrand Drouvot
<bertranddrouvot(dot)pg(at)gmail(dot)com> wrote:
>
> Hi,
>
> On Tue, Apr 02, 2024 at 12:07:54PM +0900, Masahiko Sawada wrote:
> > On Mon, Apr 1, 2024 at 12:18 PM Bharath Rupireddy
> >
> > FWIW, coming to this thread late, I think that the inactive_since
> > should not be synchronized from the primary. The wall clocks are
> > different on the primary and the standby so having the primary's
> > timestamp on the standby can confuse users, especially when there is a
> > big clock drift. Also, as Amit mentioned, inactive_since seems not to
> > be consistent with other parameters we copy. The
> > replication_slot_inactive_timeout feature should work on the standby
> > independent from the primary, like other slot invalidation mechanisms,
> > and it should be based on its own local clock.
>
> Thanks for sharing your thoughts! So, it looks like that most of us agree to not
> sync inactive_since from the primary, I'm fine with that.

+1 on not syncing slots from primary.

> > If we want to invalidate the synced slots due to the timeout, I think
> > we need to define what is "inactive" for synced slots.
> >
> > Suppose that the slotsync worker updates the local (synced) slot's
> > inactive_since whenever releasing the slot, irrespective of the actual
> > LSNs (or other slot parameters) having been updated. I think that this
> > idea cannot handle a slot that is not acquired on the primary. In this
> > case, the remote slot is inactive but the local slot is regarded as
> > active. WAL files are piled up on the standby (and on the primary) as
> > the slot's LSNs don't move forward. I think we want to regard such a
> > slot as "inactive" also on the standby and invalidate it because of
> > the timeout.
>
> I think that makes sense to somehow link inactive_since on the standby to
> the actual LSNs (or other slot parameters) being updated or not.
>
> > > > Now, the other concern is that calling GetCurrentTimestamp()
> > > > could be costly when the values for the slot are not going to be
> > > > updated but if that happens we can optimize such that before acquiring
> > > > the slot we can have some minimal pre-checks to ensure whether we need
> > > > to update the slot or not.
> >
> > If we use such pre-checks, another problem might happen; it cannot
> > handle a case where the slot is acquired on the primary but its LSNs
> > don't move forward. Imagine a logical replication conflict happened on
> > the subscriber, and the logical replication enters the retry loop. In
> > this case, the remote slot's inactive_since gets updated for every
> > retry, but it looks inactive from the standby since the slot LSNs
> > don't change. Therefore, only the local slot could be invalidated due
> > to the timeout but probably we don't want to regard such a slot as
> > "inactive".
> >
> > Another idea I came up with is that the slotsync worker updates the
> > local slot's inactive_since to the local timestamp only when the
> > remote slot might have got inactive. If the remote slot is acquired by
> > someone, the local slot's inactive_since is also NULL. If the remote
> > slot gets inactive, the slotsync worker sets the local timestamp to
> > the local slot's inactive_since. Since the remote slot could be
> > acquired and released before the slotsync worker gets the remote slot
> > data again, if the remote slot's inactive_since > the local slot's
> > inactive_since, the slotsync worker updates the local one.
>
> Then I think we would need to be careful about time zone comparison.

Yes. Also we need to consider the case when a user is relying on
pg_sync_replication_slots() and has not enabled slot-sync worker. In
such a case if synced slot's inactive_since is derived from inactivity
of remote-slot, it might not be that frequently updated (based on when
the user actually runs the SQL function) and thus may be misleading.
OTOH, if inactivty_since of synced slots represents its own
inactivity, then it will give correct info even for the case when the
SQL function is run after a long time and slot-sync worker is
disabled.

> > IOW, we
> > detect whether the remote slot was acquired and released since the
> > last synchronization, by checking the remote slot's inactive_since.
> > This idea seems to handle these cases I mentioned unless I'm missing
> > something, but it requires for the slotsync worker to update
> > inactive_since in a different way than other parameters.
> >
> > Or a simple solution is that the slotsync worker updates
> > inactive_since as it does for non-synced slots, and disables
> > timeout-based slot invalidation for synced slots.

I like this idea better, it takes care of such a case too when the
user is relying on sync-function rather than worker and does not want
to get the slots invalidated in between 2 sync function calls.

> Yeah, I think the main question to help us decide is: do we want to invalidate
> "inactive" synced slots locally (in addition to synchronizing the invalidation
> from the primary)?

thanks
Shveta

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message jian he 2024-04-03 03:30:08 Re: remaining sql/json patches
Previous Message Nathan Bossart 2024-04-03 02:09:14 Re: Popcount optimization using AVX512