Re: Introduce XID age and inactive timeout based replication slot invalidation

From: Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>
To: Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Introduce XID age and inactive timeout based replication slot invalidation
Date: 2024-02-21 12:25:25
Message-ID: ZdXrtXLkjvIJMYvB@ip-10-97-1-34.eu-west-3.compute.internal
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On Wed, Feb 21, 2024 at 10:55:00AM +0530, Bharath Rupireddy wrote:
> On Tue, Feb 20, 2024 at 12:05 PM Bharath Rupireddy
> <bharath(dot)rupireddyforpostgres(at)gmail(dot)com> wrote:
> >
> >> [...] and was able to produce something like:
> > >
> > > postgres=# select slot_name,slot_type,active,active_pid,wal_status,invalidation_reason from pg_replication_slots;
> > > slot_name | slot_type | active | active_pid | wal_status | invalidation_reason
> > > -------------+-----------+--------+------------+------------+---------------------
> > > rep1 | physical | f | | reserved |
> > > master_slot | physical | t | 1482441 | unreserved | wal_removed
> > > (2 rows)
> > >
> > > does that make sense to have an "active/working" slot "ivalidated"?
> >
> > Thanks. Can you please provide the steps to generate this error? Are
> > you setting max_slot_wal_keep_size on primary to generate
> > "wal_removed"?
>
> I'm able to reproduce [1] the state [2] where the slot got invalidated
> first, then its wal_status became unreserved, but still the slot is
> serving after the standby comes up online after it catches up with the
> primary getting the WAL files from the archive. There's a good reason
> for this state -
> https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/backend/replication/slotfuncs.c;h=d2fa5e669a32f19989b0d987d3c7329851a1272e;hb=ff9e1e764fcce9a34467d614611a34d4d2a91b50#l351.
> This intermittent state can only happen for physical slots, not for
> logical slots because logical subscribers can't get the missing
> changes from the WAL stored in the archive.
>
> And, the fact looks to be that an invalidated slot can never become
> normal but still can serve a standby if the standby is able to catch
> up by fetching required WAL (this is the WAL the slot couldn't keep
> for the standby) from elsewhere (archive via restore_command).
>
> As far as the 0001 patch is concerned, it reports the
> invalidation_reason as long as slot_contents.data.invalidated !=
> RS_INVAL_NONE. I think this is okay.
>
> Thoughts?

Yeah, looking at the code I agree that looks ok. OTOH, that looks confusing,
maybe we should add a few words about it in the doc?

Looking at v5-0001:

+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>invalidation_reason</structfield> <type>text</type>
+ </para>
+ <para>

My initial thought was to put "conflict" value in this new field in case of
conflict (not to mention the conflict reason in it). With the current proposal
invalidation_reason could report the same as conflict_reason, which sounds weird
to me.

Does that make sense to you to use "conflict" as value in "invalidation_reason"
when the slot has "conflict_reason" not NULL?

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tomas Vondra 2024-02-21 12:39:29 Re: Shared detoast Datum proposal
Previous Message Bharath Rupireddy 2024-02-21 12:01:43 Re: 'Shutdown <= SmartShutdown' check while launching processes in postmaster.