Quick Links

Re: Introduce XID age and inactive timeout based replication slot invalidation

From:	Peter Smith <smithpb2250(at)gmail(dot)com>
To:	Nisha Moond <nisha(dot)moond412(at)gmail(dot)com>
Cc:	shveta malik <shveta(dot)malik(at)gmail(dot)com>, Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Ajin Cherian <itsajin(at)gmail(dot)com>, Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Nathan Bossart <nathandbossart(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Introduce XID age and inactive timeout based replication slot invalidation
Date:	2024-11-13 23:58:45
Message-ID:	CAHut+Pt6s-qNPdxH5=-fr2QKLEv0h16sQ8EvLiGJ-SdQNS6pbw@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Hi Nisha.

Thanks for the recent patch updates. Here are my review comments for
the latest patch v48-0001.

======
Commit message

1.
Till now, postgres has the ability to invalidate inactive
replication slots based on the amount of WAL (set via
max_slot_wal_keep_size GUC) that will be needed for the slots in
case they become active. However, choosing a default value for
this GUC is a bit tricky. Because the amount of WAL a database
generates, and the allocated storage for instance will vary
greatly in production, making it difficult to pin down a
one-size-fits-all value.

What do the words "for instance" mean here? Did it mean "per instance"
or "(for example)" or something else?

======
doc/src/sgml/system-views.sgml

2.
<para>
The time since the slot has become inactive.
- <literal>NULL</literal> if the slot is currently being used.
- Note that for slots on the standby that are being synced from a
+ <literal>NULL</literal> if the slot is currently being used. Once the
+ slot is invalidated, this value will remain unchanged until we shutdown
+ the server. Note that for slots on the standby that are being
synced from a
primary server (whose <structfield>synced</structfield> field is
<literal>true</literal>), the

Is this change related to the new inactivity timeout feature or are
you just clarifying the existing behaviour of the 'active_since'
field.

Note there is already another thread [1] created to patch/clarify this
same field. So if you are just clarifying existing behavior then IMO
it would be better if you can to try and get your desired changes
included there quickly before that other patch gets pushed.

~~~

3.
+ <para>
+ <literal>inactive_timeout</literal> means that the slot has been
+ inactive for longer than the amount of time specified by the
+ <xref linkend="guc-replication-slot-inactive-timeout"/> parameter.
+ </para>

Maybe there is a slightly shorter/simpler way to express this. For example,

BEFORE
inactive_timeout means that the slot has been inactive for longer than
the amount of time specified by the replication_slot_inactive_timeout
parameter.

SUGGESTION
inactive_timeout means that the slot has remained inactive beyond the
duration specified by the replication_slot_inactive_timeout parameter.

======
src/backend/replication/slot.c

4.
+int replication_slot_inactive_timeout = 0;

IMO it would be more informative to give the units in the variable
name (but not in the GUC name). e.g.
'replication_slot_inactive_timeout_secs'.

~~~

ReplicationSlotAcquire:

5.
+ *
+ * An error is raised if error_if_invalid is true and the slot has been
+ * invalidated previously.
*/
void
-ReplicationSlotAcquire(const char *name, bool nowait)
+ReplicationSlotAcquire(const char *name, bool nowait, bool error_if_invalid)

This function comment makes it seem like "invalidated previously"
might mean *any* kind of invalidation, but later in the body of the
function we find the logic is really only used for inactive timeout.

+ /*
+ * An error is raised if error_if_invalid is true and the slot has been
+ * previously invalidated due to inactive timeout.
+ */

So, I think a better name for that parameter might be
'error_if_inactive_timeout'

OTOH, if it really is supposed to erro for *any* kind of invalidation
then there needs to be more ereports.

~~~

6.
+ errdetail("This slot has been invalidated because it was inactive
for longer than the amount of time specified by \"%s\".",

This errdetail message seems quite long. I think it can be shortened
like below and still retain exactly the same meaning:

BEFORE:
This slot has been invalidated because it was inactive for longer than
the amount of time specified by \"%s\".

SUGGESTION:
This slot has been invalidated due to inactivity exceeding the time
limit set by "%s".

~~~

ReportSlotInvalidation:

7.
+ case RS_INVAL_INACTIVE_TIMEOUT:
+ Assert(inactive_since > 0);
+ appendStringInfo(&err_detail,
+ _("The slot has been inactive since %s for longer than the amount of
time specified by \"%s\"."),
+ timestamptz_to_str(inactive_since),
+ "replication_slot_inactive_timeout");
+ break;

Here also as in the above review comment #6 I think the message can be
shorter and still say the same thing

BEFORE:
_("The slot has been inactive since %s for longer than the amount of
time specified by \"%s\"."),

SUGGESTION:
_("The slot has been inactive since %s, exceeding the time limit set
by \"%s\"."),

~~~

SlotInactiveTimeoutCheckAllowed:

8.
+/*
+ * Is this replication slot allowed for inactive timeout invalidation check?
+ *
+ * Inactive timeout invalidation is allowed only when:
+ *
+ * 1. Inactive timeout is set
+ * 2. Slot is inactive
+ * 3. Server is in recovery and slot is not being synced from the primary
+ *
+ * Note that the inactive timeout invalidation mechanism is not
+ * applicable for slots on the standby server that are being synced
+ * from the primary server (i.e., standby slots having 'synced' field 'true').
+ * Synced slots are always considered to be inactive because they don't
+ * perform logical decoding to produce changes.
+ */

8a.
Somehow that first sentence seems strange. Would it be better to write it like:

SUGGESTION
Can this replication slot timeout due to inactivity?

8b.
AFAICT that reason 3 ("Server is in recovery and slot is not being
synced from the primary") seems not quite worded right...

Should it say more like:
The slot is not being synced from the primary while the server is in recovery

or maybe like:
The slot is not currently being synced from the primary (e.g. not
'synced' is true when server is in recovery)

8c.
Similarly, I think something about that "Note that the inactive
timeout invalidation mechanism is not applicable..." paragraph needs
tweaking because IMO that should also now be saying something about
'RecoveryInProgress'.

~~~

9.
+static inline bool
+SlotInactiveTimeoutCheckAllowed(ReplicationSlot *s)

Maybe the function name should be 'IsSlotInactiveTimeoutPossible' or
something better.

~~~

InvalidatePossiblyObsoleteSlot:

10.
break;
+ case RS_INVAL_INACTIVE_TIMEOUT:
+
+ /*
+ * Check if the slot needs to be invalidated due to
+ * replication_slot_inactive_timeout GUC.
+ */

Since there are no other blank lines anywhere in this switch, the
introduction of this one in v48 looks out of place to me. IMO it would
be more readable if a blank line followed each/every of the breaks,
but then that is not a necessary change for this patch so...

~~~

11.
+ /*
+ * Invalidation due to inactive timeout implies that
+ * no one is using the slot.
+ */
+ Assert(s->active_pid == 0);

Given this assertion, does it mean that "(s->active_pid == 0)" should
have been another condition done up-front in the function
'SlotInactiveTimeoutCheckAllowed'?

~~~

12.
/*
- * If the slot can be acquired, do so and mark it invalidated
- * immediately. Otherwise we'll signal the owning process, below, and
- * retry.
+ * If the slot can be acquired, do so and mark it as invalidated. If
+ * the slot is already ours, mark it as invalidated. Otherwise, we'll
+ * signal the owning process below and retry.
*/
- if (active_pid == 0)
+ if (active_pid == 0 ||
+ (MyReplicationSlot == s &&
+ active_pid == MyProcPid))

I wasn't sure how this change belongs to this patch, because the logic
of the previous review comment said for the case of invalidation due
to inactivity that active_id must be 0. e.g. Assert(s->active_pid ==
0);

~~~

RestoreSlotFromDisk:

13.
- slot->inactive_since = GetCurrentTimestamp();
+ slot->inactive_since = now;

In v47 this assignment used to call the function
'ReplicationSlotSetInactiveSince'. I recognise there is a very subtle
difference between direct assignment and the function, because the
function will skip assignment if the slot is already invalidated.
Anyway, if you are *deliberately* not wanting to call
ReplicationSlotSetInactiveSince here then I think this assignment
should be commented to explain the reason why not, otherwise someone
in the future might be tempted to think it was just an oversight and
add the call back in that you don't want.

======
src/test/recovery/t/050_invalidate_slots.pl

14.
+# Despite inactive timeout being set, the synced slot won't get invalidated on
+# its own on the standby. So, we must not see invalidation message in server
+# log.
+$standby1->safe_psql('postgres', "CHECKPOINT");
+is( $standby1->safe_psql(
+ 'postgres',
+ q{SELECT count(*) = 1 FROM pg_replication_slots
+ WHERE slot_name = 'sync_slot1'
+ AND invalidation_reason IS NULL;}
+ ),
+ "t",
+ 'check that synced slot sync_slot1 has not been invalidated on standby');
+

But, now, we are confirming this by another way -- not checking the
logs here, so the comment "So, we must not see invalidation message in
server log." is no longer appropriate here.

======
[1] https://www.postgresql.org/message-id/flat/CAA4eK1JQFdssaBBh-oQskpKM-UpG8jPyUdtmGWa_0qCDy%2BK7_A%40mail.gmail.com#ab98379f220288ed40d34f8c2a21cf96

Kind Regards,
Peter Smith.
Fujitsu Australia

In response to

Re: Introduce XID age and inactive timeout based replication slot invalidation at 2024-11-13 09:30:28 from Nisha Moond

Responses

Re: Introduce XID age and inactive timeout based replication slot invalidation at 2024-11-19 07:17:37 from Nisha Moond

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Michael Paquier	2024-11-14 00:27:06	Re: define pg_structiszero(addr, s, r)
Previous Message	Tomas Vondra	2024-11-13 23:47:22	Re: logical replication: restart_lsn can go backwards (and more), seems broken since 9.4