From: | Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> |
---|---|
To: | Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com> |
Cc: | "suyu(dot)cmj" <mengjuan(dot)cmj(at)alibaba-inc(dot)com>, michael <michael(at)paquier(dot)xyz>, "bharath(dot)rupireddyforpostgres" <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>, "amit(dot)kapila16" <amit(dot)kapila16(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: Question about InvalidatePossiblyObsoleteSlot() |
Date: | 2025-10-16 18:32:03 |
Message-ID: | CAD21AoCh+DxBVaRQ3Uz3OzSgtoNZ-ZAjyH+e_6OZkqg4g5P6hQ@mail.gmail.com |
Views: | Whole Thread | Raw Message | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Thu, Oct 16, 2025 at 1:23 AM Bertrand Drouvot
<bertranddrouvot(dot)pg(at)gmail(dot)com> wrote:
>
> Hi,
>
> On Wed, Oct 15, 2025 at 04:24:03PM -0700, Masahiko Sawada wrote:
> > On Tue, Oct 14, 2025 at 9:51 PM Bertrand Drouvot
> > <bertranddrouvot(dot)pg(at)gmail(dot)com> wrote:
> > >
> > > We don't really report an "invalidation", what we report is:
> > >
> > > LOG: terminating process 3998707 to release replication slot "logical_slot"
> > > DETAIL: The slot's restart_lsn 0/00842480 exceeds the limit by 2874240 bytes.
> > > HINT: You might need to increase "max_slot_wal_keep_size".
> > >
> > > and we terminate the process:
> > >
> > > FATAL: terminating connection due to administrator command
> > >
> > > We are not reporting:
> > >
> > > DETAIL: This replication slot has been invalidated due to "wal_removed".
> > >
> > > and the slot is still valid.
> > >
> > > That's the pre 818fefd8fd4 behavior.
> >
> > Thank you for the clarification! Understood.
> >
> > > Ideally, I think that we should not report anything and not terminate the
> > > process. I did not look at it, maybe we could look at it as a second step (first
> > > step being to restore the pre 818fefd8fd4 behavior)?
> >
> > I find that reporting of terminating a process having an possibly
> > obsolete slot is fine, but reading some related threads[1][2] it seems
> > to me that a problem we want to avoid is that we report "terminated"
> > without leading to an "obsolete" message.
>
> Right, the focus at that time was around the invalidations related to the
> xmin conflicts (mainly because it felt unsafe to "ignore" the invalidation
> for them). Then later the restart_lsn was added to the discussion [1].
>
> But after more thought, I do think it's safe for the restart_lsn case
> (for the reasons mentioned above).
>
> > Does it make sense to report
> > explicitly that the slot's restart_lsn gets recovered and we therefore
> > skipped to invalidate it?
>
> I'm not sure. The existing logging shows when we invalidate a slot, so the
> absence of that message indicates we skipped it. Users can also verify the
> slot status in the view if needed.
>
I'm somewhat confused. As the commit message of 818fefd8fd4 says, the
invalidation of an active slot is done in two steps:
- Termination of the backend holding it, if any.
- Report that the slot is obsolete, with a conflict cause depending on
the slot's data.
Before commit 818fefd8fd4, it was possible that if slot's
effective_xmin, catalog_effective_xmin, or restart_lsn gets advanced
between two steps, we ended up skipping the actual invalidation for
the slot. It was not possible that the invalidation cause used when
terminating the process changes (like RS_INVAL_HORIZON to
RS_INVAL_WAL_REMOVED) between two steps since the possible_cause
doesn't change within InvalidatePossiblyObsoleteSlot(). The commit
818fefd8fd4 introduced initial_xxx variables to ensure that if we
report the "terminated" for a reason we invalidate the slot for the
same reason.
What the patch proposed on this thread does is that we allow such pre-
818fefd8fd4 behavior only for restart_lsn. That is, we skip the actual
slot invalidation if the slot's restart_lsn gets advanced and exceeds
the oldestLSN between two steps.
I thought the reason why we moved away from pre-818fefd8fd4 was that
we wanted to avoid reporting only "terminated" and to report both
things consistently. But if we can accept pre-818fefd8fd4 for
restart_lsn, why can we not do the same for effective_xmin and
catalog_effective_xmin? I guess that the same is true also for
effective_xmin and catalog_effective_xmin that we can skip the actual
slot invalidation if those values get advanced and exceed
snapshotConflictHorizon.
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
From | Date | Subject | |
---|---|---|---|
Next Message | Álvaro Herrera | 2025-10-16 18:34:58 | Re: Buf fix: update-po for PGXS does not work |
Previous Message | Robert Haas | 2025-10-16 18:17:51 | Re: Making pg_rewind faster |