Re: Introduce XID age and inactive timeout based replication slot invalidation

From: Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>
To: Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>
Cc: shveta malik <shveta(dot)malik(at)gmail(dot)com>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Nathan Bossart <nathandbossart(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Introduce XID age and inactive timeout based replication slot invalidation
Date: 2024-04-03 14:58:04
Message-ID: CALj2ACUBiRLHF7VYD-0nYVHSvVBSCQ_5OMbP4tE8Ys_rpvQ--A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Apr 3, 2024 at 6:46 PM Bertrand Drouvot
<bertranddrouvot(dot)pg(at)gmail(dot)com> wrote:
>
> Just one comment on v32-0001:
>
> +# Synced slot on the standby must get its own inactive_since.
> +is( $standby1->safe_psql(
> + 'postgres',
> + "SELECT '$inactive_since_on_primary'::timestamptz <= '$inactive_since_on_standby'::timestamptz AND
> + '$inactive_since_on_standby'::timestamptz <= '$slot_sync_time'::timestamptz;"
> + ),
> + "t",
> + 'synchronized slot has got its own inactive_since');
> +
>
> By using <= we are not testing that it must get its own inactive_since (as we
> allow them to be equal in the test). I think we should just add some usleep()
> where appropriate and deny equality during the tests on inactive_since.

Thanks. It looks like we can ignore the equality in all of the
inactive_since comparisons. IIUC, all the TAP tests do run with
primary and standbys on the single BF animals. And, it looks like
assigning the inactive_since timestamps to perl variables is giving
the microseconds precision level
(./tmp_check/log/regress_log_040_standby_failover_slots_sync:inactive_since
2024-04-03 14:30:09.691648+00). FWIW, we already have some TAP and SQL
tests relying on stats_reset timestamps without equality. So, I've
left the equality for the inactive_since tests.

> Except for the above, v32-0001 LGTM.

Thanks. Please see the attached v33-0001 patch after removing equality
on inactive_since TAP tests.

On Wed, Apr 3, 2024 at 1:47 PM Bertrand Drouvot
<bertranddrouvot(dot)pg(at)gmail(dot)com> wrote:
>
> Some comments regarding v31-0002:
>
> === testing the behavior
>
> T1 ===
>
> > - synced slots don't get invalidated due to inactive timeout because
> > such slots not considered active at all as they don't perform logical
> > decoding (of course, they will perform in fast_forward mode to fix the
> > other data loss issue, but they don't generate changes for them to be
> > called as *active* slots)
>
> It behaves as described. OTOH non synced logical slots on the standby and
> physical slots on the standby are invalidated which is what is expected.

Right.

> T2 ===
>
> In case the slot is invalidated on the primary,
>
> primary:
>
> postgres=# select slot_name, inactive_since, invalidation_reason from pg_replication_slots where slot_name = 's1';
> slot_name | inactive_since | invalidation_reason
> -----------+-------------------------------+---------------------
> s1 | 2024-04-03 06:56:28.075637+00 | inactive_timeout
>
> then on the standby we get:
>
> standby:
>
> postgres=# select slot_name, inactive_since, invalidation_reason from pg_replication_slots where slot_name = 's1';
> slot_name | inactive_since | invalidation_reason
> -----------+------------------------------+---------------------
> s1 | 2024-04-03 07:06:43.37486+00 | inactive_timeout
>
> shouldn't the slot be dropped/recreated instead of updating inactive_since?

The sync slots that are invalidated on the primary aren't dropped and
recreated on the standby. There's no point in doing so because
invalidated slots on the primary can't be made useful. However, I
found that the synced slot is acquired and released unnecessarily
after the invalidation_reason is synced from the primary. I added a
skip check in synchronize_one_slot to skip acquiring and releasing the
slot if it's locally found inactive. With this, inactive_since won't
get updated for invalidated sync slots on the standby as we don't
acquire and release the slot.

> === code
>
> CR1 ===
>
> + Invalidates replication slots that are inactive for longer the
> + specified amount of time
>
> s/for longer the/for longer that/?

Fixed.

> CR2 ===
>
> + <literal>true</literal>) as such synced slots don't actually perform
> + logical decoding.
>
> We're switching in fast forward logical due to [1], so I'm not sure that's 100%
> accurate here. I'm not sure we need to specify a reason.

Fixed.

> CR3 ===
>
> + errdetail("This slot has been invalidated because it was inactive for more than the time specified by replication_slot_inactive_timeout parameter.")));
>
> I think we can remove "parameter" (see for example the error message in
> validate_remote_info()) and reduce it a bit, something like?
>
> "This slot has been invalidated because it was inactive for more than replication_slot_inactive_timeout"?

Done.

> CR4 ===
>
> + appendStringInfoString(&err_detail, _("The slot has been inactive for more than the time specified by replication_slot_inactive_timeout parameter."));
>
> Same.

Done. Changed it to "The slot has been inactive for more than
replication_slot_inactive_timeout."

> CR5 ===
>
> + /*
> + * This function isn't expected to be called for inactive timeout based
> + * invalidation. A separate function InvalidateInactiveReplicationSlot is
> + * to be used for that.
>
> Do you think it's worth to explain why?

Hm, I just wanted to point out the actual function here. I modified it
to something like the following, if others feel we don't need that, I
can remove it.

/*
* Use InvalidateInactiveReplicationSlot for inactive timeout based
* invalidation.
*/

> CR6 ===
>
> + if (replication_slot_inactive_timeout == 0)
> + return false;
> + else if (slot->inactive_since > 0)
>
> "else" is not needed here.

Nothing wrong there, but removed.

> CR7 ===
>
> + SpinLockAcquire(&slot->mutex);
> +
> + /*
> + * Check if the slot needs to be invalidated due to
> + * replication_slot_inactive_timeout GUC. We do this with the spinlock
> + * held to avoid race conditions -- for example the inactive_since
> + * could change, or the slot could be dropped.
> + */
> + now = GetCurrentTimestamp();
>
> We should not call GetCurrentTimestamp() while holding a spinlock.

I was thinking why to add up the wait time to acquire
LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);. Now that I
moved it up before the spinlock but after the LWLockAcquire.

> CR8 ===
>
> +# Testcase start: Invalidate streaming standby's slot as well as logical
> +# failover slot on primary due to inactive timeout GUC. Also, check the logical
>
> s/inactive timeout GUC/replication_slot_inactive_timeout/?

Done.

> CR9 ===
>
> +# Start: Helper functions used for this test file
> +# End: Helper functions used for this test file
>
> I think that's the first TAP test with this comment. Not saying we should not but
> why did you feel the need to add those?

Hm. Removed.

> [1]: https://www.postgresql.org/message-id/OS0PR01MB5716B3942AE49F3F725ACA92943B2@OS0PR01MB5716.jpnprd01.prod.outlook.com

On Wed, Apr 3, 2024 at 2:58 PM shveta malik <shveta(dot)malik(at)gmail(dot)com> wrote:
>
> v31-002:
> (I had reviewed v29-002 but missed to post comments, I think these
> are still applicable)
>
> 1) I think replication_slot_inactivity_timeout was recommended here
> (instead of replication_slot_inactive_timeout, so please give it a
> thought):
> https://www.postgresql.org/message-id/202403260739.udlp7lxixktx%40alvherre.pgsql

Yeah. It's synonymous with inactive_since. If others have an opinion
to have replication_slot_inactivity_timeout, I'm fine with it.

> 2) Commit msg:
> a)
> "It is often easy for developers to set a timeout of say 1
> or 2 or 3 days at slot level, after which the inactive slots get
> dropped."
>
> Shall we say invalidated rather than dropped?

Right. Done that.

> b)
> "To achieve the above, postgres introduces a GUC allowing users
> set inactive timeout and then a slot stays inactive for this much
> amount of time it invalidates the slot."
>
> Broken sentence.

Reworded it a bit.

Please find the attached v33 patches.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment Content-Type Size
v33-0001-Allow-synced-slots-to-have-their-own-inactive_si.patch application/x-patch 15.1 KB
v33-0002-Add-inactive_timeout-based-replication-slot-inva.patch application/x-patch 32.1 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jelte Fennema-Nio 2024-04-03 15:05:44 Re: psql not responding to SIGINT upon db reconnection
Previous Message Tristan Partin 2024-04-03 14:55:08 Re: psql not responding to SIGINT upon db reconnection