Re: Introduce XID age and inactive timeout based replication slot invalidation

From: Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>, shveta malik <shveta(dot)malik(at)gmail(dot)com>, Nathan Bossart <nathandbossart(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Introduce XID age and inactive timeout based replication slot invalidation
Date: 2024-03-19 09:41:10
Message-ID: ZfldthofFLdPibEM@ip-10-97-1-34.eu-west-3.compute.internal
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On Tue, Mar 19, 2024 at 10:56:25AM +0530, Amit Kapila wrote:
> On Mon, Mar 18, 2024 at 8:19 PM Bertrand Drouvot
> <bertranddrouvot(dot)pg(at)gmail(dot)com> wrote:
> > Agree. While it makes sense to invalidate slots for wal removal in
> > CreateCheckPoint() (because this is the place where wal is removed), I 'm not
> > sure this is the right place for the 2 new cases.
> >
> > Let's focus on the timeout one as proposed above (as probably the simplest one):
> > as this one is purely related to time and activity what about to invalidate them
> > when?:
> >
> > - their usage resume
> > - in pg_get_replication_slots()
> >
> > The idea is to invalidate the slot when one resumes activity on it or wants to
> > get information about it (and among other things wants to know if the slot is
> > valid or not).
> >
>
> Trying to invalidate at those two places makes sense to me but we
> still need to cover the cases where it takes very long to resume the
> slot activity and the dangling slot cases where the activity is never
> resumed.

I understand it's better to have the slot reflecting its real status internally
but it is a real issue if that's not the case until the activity on it is resumed?
(just asking, not saying we should not)

> How about apart from the above two places, trying to
> invalidate in CheckPointReplicationSlots() where we are traversing all
> the slots?

I think that's a good place but there is still a window of time (that could also
be "large" depending of the activity and the checkpoint frequency) during which
the slot is not known as invalid internally. But yeah, at leat we know that we'll
mark it as invalid at some point...

BTW:

if (am_walsender)
{
+ if (slot->data.persistency == RS_PERSISTENT)
+ {
+ SpinLockAcquire(&slot->mutex);
+ slot->data.last_inactive_at = GetCurrentTimestamp();
+ slot->data.inactive_count++;
+ SpinLockRelease(&slot->mutex);

I'm also feeling the same concern as Shveta mentioned in [1]: that a "normal"
backend using pg_logical_slot_get_changes() or friends would not set the
last_inactive_at.

[1]: https://www.postgresql.org/message-id/CAJpy0uD64X%3D2ENmbHaRiWTKeQawr-rbGoy_GdhQQLVXzUSKTMg%40mail.gmail.com

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Eisentraut 2024-03-19 09:50:23 Re: Inconsistent printf placeholders
Previous Message John Naylor 2024-03-19 09:40:06 Re: [PoC] Improve dead tuple storage for lazy vacuum