Re: recovering from "found xmin ... from before relfrozenxid ..."

From: Andres Freund <andres(at)anarazel(dot)de>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: recovering from "found xmin ... from before relfrozenxid ..."
Date: 2020-07-14 19:31:26
Message-ID: 20200714193126.2xxtinqnm4hogzjt@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2020-07-13 21:18:10 -0400, Robert Haas wrote:
> On Mon, Jul 13, 2020 at 9:10 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
> > > What if clog has been truncated so that the xmin can't be looked up?
> >
> > That's possible, but probably only in cases where xmin actually
> > committed.
>
> Isn't that the normal case? I'm imagining something like:
>
> - Tuple gets inserted. Transaction commits.
> - VACUUM processes table.
> - Mischievous fairies mark page all-visible in the visibility map.
> - VACUUM runs lots more times, relfrozenxid advances, but without ever
> looking at the page in question, because it's all-visible.
> - clog is truncated, rendering xmin no longer accessible.
> - User runs VACUUM disabling page skipping, gets ERROR.
> - User deletes offending tuple.
> - At this point, I think the tuple is both invisible and unprunable?
> - Fairies happy, user sad.

I'm not saying it's impossible that that happens, but the cases I did
investigate didn't look like this. If something just roguely wrote to
the VM I'd expect a lot more "is not marked all-visible but visibility
map bit is set in relation" type WARNINGs, and I've not seen much of
those (they're WARNINGs though, so maybe we wouldn't). Presumably this
wouldn't always just happen with tuples that'd trigger an error first
during hot pruning.

I've definitely seen indications of both datfrozenxid and relfrozenxid
getting corrupted (in particular vac_update_datfrozenxid being racy as
hell), xid wraparound, indications of multixact problems (although it's
possible we've now fixed those) and some signs of corrupted relcache
entries for shared relations leading to vacuums being skipped.

Greetings,

Andres Freund

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2020-07-14 19:36:55 Re: recovering from "found xmin ... from before relfrozenxid ..."
Previous Message Andres Freund 2020-07-14 19:18:01 Re: [HACKERS] PATCH: Batch/pipelining support for libpq