Re: Introduce XID age and inactive timeout based replication slot invalidation

From: Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>, Nathan Bossart <nathandbossart(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Introduce XID age and inactive timeout based replication slot invalidation
Date: 2024-03-26 08:57:17
Message-ID: CALj2ACW-QtWfyC7cUXJsRe_YO50fNj5pKFY5FA7BC3LJv4XNEw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Mar 26, 2024 at 11:26 AM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> Review comments on v18_0002 and v18_0005
> =======================================
>
> 1.
> We have decided to update inactive_since for temporary slots. So,
> unless there is some reason, we should allow inactive_timeout to also
> be set for temporary slots.

WFM. A temporary slot that's inactive for a long time before even the
server isn't shutdown can utilize this inactive_timeout based
invalidation mechanism. And, I'd also vote for we being consistent for
temporary and synced slots.

> L.last_inactive_time,
> + L.inactive_timeout,
>
> Shall we keep inactive_timeout before
> last_inactive_time/inactive_since? I don't have any strong reason to
> propose that way apart from that the former is provided by the user.

Done.

> + if (InvalidateReplicationSlotForInactiveTimeout(slot, false, true, true))
> + invalidated = true;
>
> I don't think we should try to invalidate the slots in
> pg_get_replication_slots. This function's purpose is to get the
> current information on slots and has no intention to perform any work
> for slots. Any error due to invalidation won't be what the user would
> be expecting here.

Agree. Removed.

> 4.
> +static bool
> +InvalidateSlotForInactiveTimeout(ReplicationSlot *slot,
> + bool need_control_lock,
> + bool need_mutex)
> {
> ...
> ...
> + if (need_control_lock)
> + LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
> +
> + Assert(LWLockHeldByMeInMode(ReplicationSlotControlLock, LW_SHARED));
> +
> + /*
> + * Check if the slot needs to be invalidated due to inactive_timeout. We
> + * do this with the spinlock held to avoid race conditions -- for example
> + * the restart_lsn could move forward, or the slot could be dropped.
> + */
> + if (need_mutex)
> + SpinLockAcquire(&slot->mutex);
> ...
>
> I find this combination of parameters a bit strange. Because, say if
> need_mutex is false and need_control_lock is true then that means this
> function will acquire LWlock after acquiring spinlock which is
> unacceptable. Now, this may not happen in practice as the callers
> won't pass such a combination but still, this functionality should be
> improved.

Right. Either we need two locks or not. So, changed it to use just one
bool need_locks, upon set both control lock and spin lock are acquired
and released.

On Mon, Mar 25, 2024 at 10:33 AM shveta malik <shveta(dot)malik(at)gmail(dot)com> wrote:
>
> patch 002:
>
> 2)
> slotsync.c:
>
> ReplicationSlotCreate(remote_slot->name, true, RS_TEMPORARY,
> remote_slot->two_phase,
> remote_slot->failover,
> - true);
> + true, 0);
>
> + slot->data.inactive_timeout = remote_slot->inactive_timeout;
>
> Is there a reason we are not passing 'remote_slot->inactive_timeout'
> to ReplicationSlotCreate() directly?

The slot there gets created temporarily for which we were not
supporting inactive_timeout being set. But, in the latest v22 patch we
are supporting, so passing the remote_slot->inactive_timeout directly.

> 3)
> slotfuncs.c
> pg_create_logical_replication_slot():
> + int inactive_timeout = PG_GETARG_INT32(5);
>
> Can we mention here that timeout is in seconds either in comment or
> rename variable to inactive_timeout_secs?
>
> Please do this for create_physical_replication_slot(),
> create_logical_replication_slot(),
> pg_create_physical_replication_slot() as well.

Added /* in seconds */ next the variable declaration.

> ---------
> 4)
> + int inactive_timeout; /* The amount of time in seconds the slot
> + * is allowed to be inactive. */
> } LogicalSlotInfo;
>
> Do we need to mention "before getting invalided" like other places
> (in last patch)?

Done.

> 5)
> Same at these two places. "before getting invalided" to be added in
> the last patch otherwise the info is incompleted.
>
> +
> + /* The amount of time in seconds the slot is allowed to be inactive */
> + int inactive_timeout;
> } ReplicationSlotPersistentData;
>
>
> + * inactive_timeout: The amount of time in seconds the slot is allowed to be
> + * inactive.
> */
> void
> ReplicationSlotCreate(const char *name, bool db_specific,
> Same here. "before getting invalidated" ?

Done.

On Tue, Mar 26, 2024 at 12:04 PM shveta malik <shveta(dot)malik(at)gmail(dot)com> wrote:
>
> > Please find the attached v21 patch implementing the above idea. It
> > also has changes for renaming last_inactive_time to inactive_since.
>
> Thanks for the patch. I have tested this patch alone, and it does what
> it says. One additional thing which I noticed is that now it sets
> inactive_since for temp slots as well, but that idea looks fine to me.

Right. Let's be consistent by treating all slots the same.

> I could not test 'invalidation on promotion bug' with this change, as
> that needed rebasing of the rest of the patches.

Please use the v22 patch set.

> Few trivial things:
>
> 1)
> Commti msg:
>
> ensures the value is set to current timestamp during the
> shutdown to help correctly interpret the time if the standby gets
> promoted without a restart.
>
> shutdown --> shutdown of slot sync worker (as it was not clear if it
> is instance shutdown or something else)

Changed it to "shutdown of slot sync machinery" to be consistent with
the comments.

> 2)
> 'The time since the slot has became inactive'.
>
> has became-->has become
> or just became
>
> Please check it in all the files. There are multiple places.

Fixed.

Please see the attached v23 patches. I've addressed all the review
comments received so far from Amit and Shveta.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment Content-Type Size
v22-0001-Fix-review-comments-for-slot-s-last_inactive_tim.patch application/x-patch 21.2 KB
v22-0002-Allow-setting-inactive_timeout-for-replication-s.patch application/x-patch 34.7 KB
v22-0003-Add-inactive_timeout-based-replication-slot-inva.patch application/x-patch 26.0 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Dean Rasheed 2024-03-26 09:04:20 Re: Catalog domain not-null constraints
Previous Message Alvaro Herrera 2024-03-26 08:41:05 Re: pgsql: Track last_inactive_time in pg_replication_slots.