Re: HOT chain validation in verify_heapam()

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Himanshu Upadhyaya <upadhyaya(dot)himanshu(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Aleksander Alekseev <aleksander(at)timescale(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: HOT chain validation in verify_heapam()
Date: 2022-11-10 01:32:46
Message-ID: CAH2-Wz=R5utC8ET8848Adexj-_gJi-2XOrbCNRZ0G-m3AS_GPA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Nov 9, 2022 at 4:15 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
> To me this is extending the problem into more areas rather than reducing
> it. I'd have *zero* confidence in any warnings that amcheck issued that
> involved <9.4 special cases.

Maybe you would at first. But then we get to learn what mistake we
made. And then we get to fix the bug, and get to know better next time
around. Next time (or maybe the time after that) you really will have
confidence in amcheck, because it'll have been battle tested at that
point.

For something like this that seems like the best way to go. Either we
support an on-disk format that includes legacy representations, or we
don't.

> I think it doesn't just affect the < 9.4 path, but also makes us implement
> things differently for >= 9.4. And we loose some accuracy due to that.

I don't follow. How so?

> The field we check for FrozenTransactionId in the code I was quoting is the
> xmin of the follower tuple. We follow the chain if either
> cur->xmax == next->xmin or if next->xmin == FrozenTransactionId
>
> What I'm doubting is the FrozenTransactionId path.

AFAICT we shouldn't be treating it as part of the same HOT chain. Only
the first heap-only tuple in a valid HOT chain should have an xmin
that is FrozenTransactionId (and only with an LP_REDIRECT root item, I
think). Otherwise the "prev_xmax==xmin" HOT chain traversal logic used
in places like heap_hot_search_buffer() simply won't work.

> > In my view it simply isn't possible for a valid HOT chain to be in
> > this state in the first place. So by definition it wouldn't be a HOT
> > chain.
>
> We haven't done any visibility checking at this point and my whole point is
> that there's no guarantee that the pointed-to tuple actually belongs to the
> same hot chain, given that we follow as soon as "xmin == FrozenXid". So the
> pointing tuple might be an orphaned tuple.

But an orphaned heap-only tuple shouldn't ever have
xmin==FrozenTransactionId to begin with. The only valid source of
orphaned heap-only tuples is transaction aborts. Aborted XID xmin
fields are never frozen.

> > That would be a form of corruption, which is something that
> > would probably be detected by noticing orphaned heap-only tuples
> > (heap-only tuples not reachable from some root item on the same page,
> > or some other intermediary heap-only tuple reachable from a root
> > item).
>
> You're saying that there's no way that there's a tuple pointing to another
> tuple on the same page, which the pointed-to tuple belonging to a different
> HOT chain?

Define "belongs to a different HOT chain".

You can get orphaned heap-only tuples, obviously. But only due to
transaction abort. Any page with an orphaned heap-only tuple that is
not consistent with it being from an earlier abort is a corrupt heap
page.

> > > Can't we have a an update chain that is e.g.
> > > xmin 10, xmax 5 -> xmin 5, xmax invalid
> > >
> > > and a vacuum cutoff of 7? That'd preent the first tuple from being removed,
> > > but would allow 5 to be frozen.
> >
> > I don't see how that can be possible. That is contradictory, and
> > cannot possibly work, since it supposes a situation where every
> > possible MVCC snapshot sees the update that generated the
> > second/successor tuple as committed, while at the same time also
> > somehow needing the original tuple to stay in place. Surely both
> > things can never be true at the same time.
>
> The xmin horizon is very coarse grained. Just because it is 7 doesn't mean
> that xid 10 is still running. All it means that one backend or slot has an
> xmin or xid of 7.

Of course that's true. But I wasn't talking about the general case --
I was talking about your "xmin 10, xmax 5 -> xmin 5, xmax invalid"
update chain case specifically, with its "skewered" OldestXmin of 7.

> s1: acquire xid 5
> s2: acquire xid 7
> s3: acquire xid 10
>
> s3: insert
> s3: commit
> s1: update
> s1: commit
>
> s2: get a new snapshot, xmin 7 (or just hold no snapshot)
>
> At this point the xmin horizon is 7. The first tuple's xmin can't be
> frozen. The second tuple's xmin can be.

Basically what I'm saying about OldestXmin is that it ought to "work
transitively", from the updater to the inserter that inserted the
now-updated tuple. That is, the OldestXmin should either count both
XIDs that appear in the update chain, or neither XID.

> > I believe you're right that an update chain that looks like this one
> > is possible. However, I don't think it's possible for
> > OldestXmin/FreezeLimit to take on a value like that (i.e. a value that
> > "skewers" the update chain like this, the value 7 from your example).
> > We ought to be able to rely on an OldestXmin value that can never let
> > such a situation emerge. Right?
>
> I don't see anything that'd guarantee that currently, nor do immediately see a
> possible way to get there.
>
> What do you think prevents such an OldestXmin?

ComputeXidHorizons() computes VACUUM's OldestXmin (actually it
computes h->data_oldest_nonremovable values) by scanning the proc
array. And counts PGPROC.xmin from each running xact. So ultimately
the inserter and updater are tied together by that. It's either an
OldestXmin that includes both, or one that includes neither.

Here are some facts that I think we both agree on already:

1. It is definitely possible to have an update chain like your "xmin
10, xmax 5 -> xmin 5, xmax invalid" example.

2. It is definitely not possible to "freeze xmax" by setting its value
to FrozenTransactionId or something similar -- there is simply no code
path that can do that, and never has been. (The term "freeze xmax" is
a bit ambiguous, though it usually means set xmax to
InvalidTransactionId.)

3. There is no specific reason to believe that there is a live bug here.

Putting all 3 together: doesn't it seem quite likely that the way that
we compute OldestXmin is the factor that prevents "skewering" of an
update chain? What else could possibly be preventing corruption here?
(Theoretically it might never have been discovered, but that seems
pretty hard to believe.)

--
Peter Geoghegan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2022-11-10 01:46:07 Re: HOT chain validation in verify_heapam()
Previous Message Michael Paquier 2022-11-10 01:29:40 Re: Allow file inclusion in pg_hba and pg_ident files