Re: relfrozenxid may disagree with row XIDs after 1ccc1e05ae

From: Bowen Shi <zxwsbg12138(at)gmail(dot)com>
To: Melanie Plageman <melanieplageman(at)gmail(dot)com>
Cc: Noah Misch <noah(at)leadboat(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Robert Haas <robertmhaas(at)gmail(dot)com>, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, Peter Geoghegan <pg(at)bowt(dot)ie>, Alexander Lakhin <exclusion(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-bugs(at)lists(dot)postgresql(dot)org>
Subject: Re: relfrozenxid may disagree with row XIDs after 1ccc1e05ae
Date: 2024-05-28 08:29:10
Message-ID: CAM_vCuc297m-ZroQM_yT561T6_uFYHLdE=b7+5PA6QQqjB-8UQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Hi,

I only had time to check the fix_hang_15.patch until today.

On Thu, May 23, 2024 at 12:57 AM Melanie Plageman <melanieplageman(at)gmail(dot)com>
wrote:

> On Mon, May 20, 2024 at 4:48 PM Noah Misch <noah(at)leadboat(dot)com> wrote:
> >
> > On Mon, May 20, 2024 at 11:58:23AM -0400, Melanie Plageman wrote:
> > > On Sat, May 18, 2024 at 6:23 PM Noah Misch <noah(at)leadboat(dot)com> wrote:
> > > > Are there obstacles to fixing the hang by back-patching 1ccc1e05ae
> instead of
> > > > this? We'll need to get confident about 1ccc1e05ae before v17, and
> that
> > > > sounds potentially easier than getting confident about both
> 1ccc1e05ae and
> > > > this other approach.
> > >
> > > I haven't tried back-patching 1ccc1e05ae yet, but I don't understand
> > > why we would want to use stable back branches to get comfortable with
> > > an approach committed to an unreleased version of Postgres.
> >
> > I wouldn't say we want to use stable back branches to get comfortable
> with an
> > approach. I wanted to say that it's less work to be confident about one
> new
> > way of doing things than two new ways of doing things.
>
> I think I understand your point better now.
>
> > > The small fix proposed in this thread is extremely minimal and
> > > straightforward. It seems much less risky as a backpatch.
> >
> > Here's how I model the current and proposed code:
> >
> > 1. v15 VACUUM removes tuples that are HEAPTUPLE_DEAD according to
> VisTest.
> > OldestXmin doesn't cause tuple removal, but there's a hang when
> OldestXmin
> > rules DEAD after VisTest ruled RECENTLY_DEAD.
> >
> > 2. With 1ccc1e05ae, v17 VACUUM still removes tuples that are
> HEAPTUPLE_DEAD
> > according to VisTest only. OldestXmin doesn't come into play.
> >
> > 3. fix_hang_15.patch would make v15 VACUUM remove tuples that are
> > HEAPTUPLE_DEAD according to _either_ VisTest or OldestXmin.
> >
> > Since (3) is the only list entry that removes tuples that VisTest ruled
> > RECENTLY_DEAD, I find it higher risk. For all three, the core task of
> > certifying the behavior is confirming that its outcome is sound when
> VisTest
> > and OldestXmin disagree. How do you model it?
>
> Okay, I see your point. In 1 and 2, tuples that would have been
> considered HEAPTUPLE_DEAD at the beginning of vacuum but are
> considered HEAPTUPLE_RECENTLY_DEAD when pruning them are not removed.
> In 3, tuples that would have been considered HEAPTUPLE_DEAD at the
> beginning of vacuum are always removed, regardless of whether or not
> they would be considered HEAPTUPLE_RECENTLY_DEAD when pruning them.
>
> One option is to add the logic in fix_hang_15.patch to master as well
> (always remove tuples older than OldestXmin). This addresses your
> concern about gaining confidence in a single solution.
>
> However, I can see how removing more tuples could be concerning. In
> the case that the horizon moves backwards because of a standby
> reconnecting, I think the worst case is that removing that tuple
> causes a recovery conflict on the standby (depending on the value of
> max_standby_streaming_delay et al).
>

What would happen if we simply skipped the current page when we found the
vacuum process had entered the infinite loop (use a counter)?

--
Regards
Bowen Shi

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Bowen Shi 2024-05-28 09:03:37 Re: relfrozenxid may disagree with row XIDs after 1ccc1e05ae
Previous Message PG Bug reporting form 2024-05-28 07:47:58 BUG #18484: "Cannot enlarge string buffer" during parallel execution of prepared statement/partitioning