Re: recovering from "found xmin ... from before relfrozenxid ..."

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Ashutosh Sharma <ashu(dot)coek88(at)gmail(dot)com>, Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com>, Masahiko Sawada <masahiko(dot)sawada(at)2ndquadrant(dot)com>, "Andrey M(dot) Borodin" <x4mmm(at)yandex-team(dot)ru>, MBeena Emerson <mbeena(dot)emerson(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, Stephen Frost <sfrost(at)snowman(dot)net>
Subject: Re: recovering from "found xmin ... from before relfrozenxid ..."
Date: 2020-09-21 03:31:48
Message-ID: CAA4eK1KsVpDQPMj5hB5tT9=Pu8-9OG2EOcLHb+aeEg8YSV1YjA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sun, Sep 20, 2020 at 10:43 PM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>
> I wrote:
> > So this confirms the suspicion that the cause of the buildfarm
> > failures is a concurrently-open transaction, presumably from
> > autovacuum. I don't have time to poke further right now.
>
..
> 2. As Amit suspected, there's an inconsistency between pruneheap.c's
> rules for which tuples are removable and vacuum.c's rules for that.
> This seems like a massive bug in its own right: what's the point of
> pruneheap.c going to huge effort to decide whether it should keep a
> tuple if vacuum will then kill it anyway? I do not understand why
> whoever put in the GlobalVisState stuff only applied it in pruneheap.c
> and not VACUUM proper.
>
> These two points interact, in that we don't get to the "cannot freeze"
> failure except when pruneheap has decided not to remove something that
> is removable according to VACUUM's rules. (VACUUM doesn't actually
> remove it, because lazy_scan_heap won't try to remove HeapOnly tuples
> even when it thinks they're HEAPTUPLE_DEAD; but then it tries to freeze
> the tuple, and heap_prepare_freeze_tuple spits up.) However, if I revert
> a7212be8b then the pg_surgery test still fails in the presence of a
> concurrent transaction (both of the expected "skipping TID" notices
> disappear). So reverting that patch wouldn't get us out of trouble.
>
> I think to move forward, we need to figure out what the freezing
> behavior ought to be for temp tables. We could make it the same
> as it was before a7212be8b, which'd just require some more complexity
> in vacuum_set_xid_limits. However, that negates the idea that we'd
> like VACUUM's behavior on a temp table to be fully independent of
> whether concurrent transactions exist. I'd prefer to allow a7212be8b's
> behavior to stand, but then it seems we need to lobotomize the error
> check in heap_prepare_freeze_tuple to some extent.
>
> Independently of that, it seems like we need to fix things so that
> when pruneheap.c is called by vacuum, it makes EXACTLY the same
> dead-or-not-dead decisions that the main vacuum code makes. This
> business with applying some GlobalVisState rule or other instead
> seems just as unsafe as can be.
>

Yeah, on a quick look it seems before commit dc7420c2c9 the
pruneheap.c and the main Vacuum code use to make the same decision and
that is commit which has introduced GlobalVisState stuff.

> AFAICS, there is no chance of the existing pg_surgery regression test
> being fully stable if we don't fix both things.
>

What if ensure that it runs with autovacuum = off and there is no
parallel test running? I am not sure about the second part but if we
can do that then the test will be probably stable.

--
With Regards,
Amit Kapila.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andy Fan 2020-09-21 03:41:09 Re: Dynamic gathering the values for seq_page_cost/xxx_cost
Previous Message Ashutosh Sharma 2020-09-21 02:39:20 Re: recovering from "found xmin ... from before relfrozenxid ..."