Re: Introduce XID age and inactive timeout based replication slot invalidation

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>
Cc: Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>, shveta malik <shveta(dot)malik(at)gmail(dot)com>, Nathan Bossart <nathandbossart(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Introduce XID age and inactive timeout based replication slot invalidation
Date: 2024-03-19 05:26:25
Message-ID: CAA4eK1KhMr5gEvzAutxkS--n6y16-mrMxF1fiq-d-RSo_URE5w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Mar 18, 2024 at 8:19 PM Bertrand Drouvot
<bertranddrouvot(dot)pg(at)gmail(dot)com> wrote:
>
> On Thu, Mar 14, 2024 at 12:27:26PM +0530, Amit Kapila wrote:
> > On Wed, Mar 13, 2024 at 10:16 PM Bharath Rupireddy
> > <bharath(dot)rupireddyforpostgres(at)gmail(dot)com> wrote:
> > >
> > > On Wed, Mar 13, 2024 at 11:13 AM shveta malik <shveta(dot)malik(at)gmail(dot)com> wrote:
> > > >
> > > > > Thanks. v8-0001 is how it looks. Please see the v8 patch set with this change.
> > > >
> > > > JFYI, the patch does not apply to the head. There is a conflict in
> > > > multiple files.
> > >
> > > Thanks for looking into this. I noticed that the v8 patches needed
> > > rebase. Before I go do anything with the patches, I'm trying to gain
> > > consensus on the design. Following is the summary of design choices
> > > we've discussed so far:
> > > 1) conflict_reason vs invalidation_reason.
> > > 2) When to compute the XID age?
> > >
> >
> > I feel we should focus on two things (a) one is to introduce a new
> > column invalidation_reason, and (b) let's try to first complete
> > invalidation due to timeout. We can look into XID stuff if time
> > permits, remember, we don't have ample time left.
>
> Agree. While it makes sense to invalidate slots for wal removal in
> CreateCheckPoint() (because this is the place where wal is removed), I 'm not
> sure this is the right place for the 2 new cases.
>
> Let's focus on the timeout one as proposed above (as probably the simplest one):
> as this one is purely related to time and activity what about to invalidate them
> when?:
>
> - their usage resume
> - in pg_get_replication_slots()
>
> The idea is to invalidate the slot when one resumes activity on it or wants to
> get information about it (and among other things wants to know if the slot is
> valid or not).
>

Trying to invalidate at those two places makes sense to me but we
still need to cover the cases where it takes very long to resume the
slot activity and the dangling slot cases where the activity is never
resumed. How about apart from the above two places, trying to
invalidate in CheckPointReplicationSlots() where we are traversing all
the slots? This could prevent invalid slots from being marked as
dirty.

BTW, how will the user use 'inactive_count' to know whether a
replication slot is becoming inactive too frequently? The patch just
keeps incrementing this counter, one will never know in the last 'n'
minutes, how many times the slot became inactive unless there is some
monitoring tool that keeps capturing this counter from time to time
and calculates the frequency in some way. Even, if this is useful, it
is not clear to me whether we need to store 'inactive_count' in the
slot's persistent data. I understand it could be a metric required by
the user but wouldn't it be better to track this via
pg_stat_replication_slots such that we don't need to store this in
slot's persist data? If this understanding is correct, I would say
let's remove 'inactive_count' as well from the main patch and discuss
it separately.

--
With Regards,
Amit Kapila.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Richard Guo 2024-03-19 06:12:34 Re: Support run-time partition pruning for hash join
Previous Message Andrei Lepikhov 2024-03-19 05:16:59 Re: POC, WIP: OR-clause support for indexes