Re: Introduce XID age and inactive timeout based replication slot invalidation

From: Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>
To: shveta malik <shveta(dot)malik(at)gmail(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Nathan Bossart <nathandbossart(at)gmail(dot)com>, "Drouvot, Bertrand" <bertranddrouvot(dot)pg(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Introduce XID age and inactive timeout based replication slot invalidation
Date: 2024-03-16 03:59:01
Message-ID: CALj2ACWBbSa7ccOGDHNfH1OX7_ArBJm1kNmc=iWKSfqq1NHnuA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Mar 15, 2024 at 12:49 PM shveta malik <shveta(dot)malik(at)gmail(dot)com> wrote:
>
> patch002:
>
> 1)
> I would like to understand the purpose of 'inactive_count'? Is it only
> for users for monitoring purposes? We are not using it anywhere
> internally.

inactive_count metric helps detect unstable replication slots
connections that have a lot of disconnections. It's not used for the
inactive_timeout based slot invalidation mechanism.

> I shutdown the instance 5 times and found that 'inactive_count' became
> 5 for all the slots created on that instance. Is this intentional?

Yes, it's incremented on shutdown (and for that matter upon every slot
release) for all the slots that are tied to walsenders.

> I mean we can not really use them if the instance is down. I felt it
> should increment the inactive_count only if during the span of
> instance, they were actually inactive i.e. no streaming or replication
> happening through them.

inactive_count is persisted to disk- upon clean shutdown, so, once the
slots become active again, one gets to see the metric and deduce some
info on disconnections.

Having said that, I'm okay to hear from others on the inactive_count
metric being added.

> 2)
> slot.c:
> + case RS_INVAL_XID_AGE:
>
> Can we optimize this code? It has duplicate code for processing
> s->data.catalog_xmin and s->data.xmin. Can we create a sub-function
> for this purpose and call it twice here?

Good idea. Done that way.

> 2)
> The msg for patch 3 says:
> --------------
> a) when replication slots is lying inactive for a day or so using
> last_inactive_at metric,
> b) when a replication slot is becoming inactive too frequently using
> last_inactive_at metric.
> --------------
> I think in b, you want to refer to inactive_count instead of last_inactive_at?

Right. Changed.

> 3)
> I do not see invalidation_reason updated for 2 new reasons in system-views.sgml

Nice catch. Added them now.

I've also responded to Bertrand's comments here.

On Wed, Mar 6, 2024 at 3:56 PM Bertrand Drouvot
<bertranddrouvot(dot)pg(at)gmail(dot)com> wrote:
>
> A few comments:
>
> 1 ===
>
> + The reason for the slot's invalidation. <literal>NULL</literal> if the
> + slot is currently actively being used.
>
> s/currently actively being used/not invalidated/ ? (I mean it could be valid
> and not being used).

Changed.

> 3 ===
>
> res = executeQueryOrDie(conn, "SELECT slot_name, plugin, two_phase, failover, "
> - "%s as caught_up, conflict_reason IS NOT NULL as invalid "
> + "%s as caught_up, invalidation_reason IS NOT NULL as invalid "
> "FROM pg_catalog.pg_replication_slots "
> - "(CASE WHEN conflict_reason IS NOT NULL THEN FALSE "
> + "(CASE WHEN invalidation_reason IS NOT NULL THEN FALSE "
>
> Yeah that's fine because there is logical slot filtering here.

Right. And, we really are looking for invalid slots there, so use of
invalidation_reason is much more correct than conflicting.

> 4 ===
>
> -GetSlotInvalidationCause(const char *conflict_reason)
> +GetSlotInvalidationCause(const char *invalidation_reason)
>
> Should we change the comment "Maps a conflict reason" above this function?

Changed.

> 5 ===
>
> -# Check conflict_reason is NULL for physical slot
> +# Check invalidation_reason is NULL for physical slot
> $res = $node_primary->safe_psql(
> 'postgres', qq[
> - SELECT conflict_reason is null FROM pg_replication_slots where slot_name = '$primary_slotname';]
> + SELECT invalidation_reason is null FROM pg_replication_slots where slot_name = '$primary_slotname';]
> );
>
>
> I don't think this test is needed anymore: it does not make that much sense since
> it's done after the primary database initialization and startup.

It is now turned into a test verifying 'conflicting boolean' is null
for the physical slot. Isn't that okay?

> 6 ===
>
> 'Logical slots are reported as non conflicting');
>
> What about?
>
> "
> # Verify slots are reported as valid in pg_replication_slots
> 'Logical slots are reported as valid');
> "

Changed.

Please see the attached v11 patch set with all the above review
comments addressed.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment Content-Type Size
v11-0001-Track-invalidation_reason-in-pg_replication_slot.patch application/octet-stream 20.2 KB
v11-0002-Add-XID-age-based-replication-slot-invalidation.patch application/octet-stream 14.0 KB
v11-0003-Track-inactive-replication-slot-information.patch application/octet-stream 10.0 KB
v11-0004-Add-inactive_timeout-based-replication-slot-inva.patch application/octet-stream 12.0 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bharath Rupireddy 2024-03-16 04:25:53 Add TAP tests for backtrace functionality (was Re: Add test module for verifying backtrace functionality)
Previous Message Fujii.Yuki@df.MitsubishiElectric.co.jp 2024-03-16 02:28:50 RE: Partial aggregates pushdown