Re: relfrozenxid may disagree with row XIDs after 1ccc1e05ae

From: Noah Misch <noah(at)leadboat(dot)com>
To: Melanie Plageman <melanieplageman(at)gmail(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, Peter Geoghegan <pg(at)bowt(dot)ie>, Alexander Lakhin <exclusion(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-bugs(at)lists(dot)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de>
Subject: Re: relfrozenxid may disagree with row XIDs after 1ccc1e05ae
Date: 2024-03-22 19:43:23
Message-ID: 20240322194323.8a.nmisch@google.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On Fri, Mar 22, 2024 at 02:41:25PM -0400, Melanie Plageman wrote:
> On Fri, Mar 22, 2024 at 8:22 AM Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> > On Thu, Mar 21, 2024 at 1:22 PM Matthias van de Meent
> > <boekewurm+postgres(at)gmail(dot)com> wrote:
> > > > So it seems like Matthias, Peter, and Andres all agree that
> > > > GlobalVisState->maybe_needed going backward is bad and causes this
> > > > problem. Unfortunately, I don't understand the mechanism.
> > >
> > > There are 2 mechanisms I know of which allow this value to go backwards:
> >
> > I actually wasn't asking about the mechanism by which
> > GlobalVisState->maybe_needed could go backwards. I was asking about
> > the mechanism by which that could cause bad things to happen.
> >
> > > 1. Replication slots that connect may set their backend's xmin to an
> > > xmin < GlobalXmin.
> > > This is known and has been documented, and was considered OK when this
> > > was discussed on the list previously.
> >
> > Right, OK.
> >
> > > 2. The commit abort path has a short window in which the backend's
> > > xmin is unset and does not mirror the xmin of registered snapshots.
> > > This is what I described in [0], and may be the worst (?) offender.
> > >
> > > [0] https://www.postgresql.org/message-id/CAEze2Wj%2BV0kTx86xB_YbyaqTr5hnE_igdWAwuhSyjXBYscf5-Q%40mail.gmail.com
> >
> > So, what I would say is that this sounds inadvertent and so perhaps we
> > should do something about it, but also, it seems wrong to me that it
> > causes any serious problem. As far as I know, we've always treated the
> > result of an xmin calculation going backward as a rare but expected
> > case with which everything that depends on xmin calculations must
> > cope.
>
> I'm still catching up here, so forgive me if this is a dumb question:
> Does using GlobalVisState instead of VacuumCutoffs->OldestXmin when
> freezing and determining relfrozenxid not solve the problem?

One could fix it along those lines. If GlobalVisState moves forward during
VACUUM, that's fine, but relfrozenxid needs to reflect the overall outcome,
not just the final GlobalVisState. Suppose we remove XIDs <100 at page 1, <99
at page 2, and <101 at page 3. relfrozenxid needs the value it would get if
we had removed <99 at every page. I think GlobalVisState doesn't track that
today, but it could. The 2024-03-14 commit e85662d added
GetStrictOldestNonRemovableTransactionId(), which targets a similar problem.
I've not reviewed it, but I suggest checking it for relevance to $SUBJECT.

In response to

Browse pgsql-bugs by date

  From Date Subject
Next Message Bruce Momjian 2024-03-22 19:44:28 Re: Regression tests fail with musl libc because libpq.so can't be loaded
Previous Message Robert Haas 2024-03-22 19:34:27 Re: relfrozenxid may disagree with row XIDs after 1ccc1e05ae