Re: prion failed with ERROR: missing chunk number 0 for toast value 14334 in pg_toast_2619

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de>, Justin Pryzby <pryzby(at)telsasoft(dot)com>
Subject: Re: prion failed with ERROR: missing chunk number 0 for toast value 14334 in pg_toast_2619
Date: 2021-05-16 20:23:02
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I wrote:
> What I do have that's new is that *I can reproduce it*, at long last.

I see what's happening, but I'm not quite sure where we should fix it.

> I do not have a lot of idea why, but I see something that is
> probably a honkin' big clue:
> 2021-05-15 17:28:05.965 EDT [2469444] LOG: HeapTupleSatisfiesToast: xmin 2 t_infomask 0x0b02
> That is, the toast tuples in question are not just frozen, but
> actually have xmin = FrozenTransactionId.

Nah, that was a mistake: I was printing xmin using HeapTupleHeaderGetXmin
where I should have used HeapTupleHeaderGetRawXmin.

Anyway, the detailed sequence of events is that a background autoanalyze
replaces a pg_statistic entry, including outdating a toasted field, and
commits. The foreground pg_dump session has already fetched that entry,
and in the race condition in question, it doesn't try to read the toast
rows till after the commit. The reason it fails to find them is that
as part of the normal indexscan sequence for reading the toast table,
we execute heap_page_prune, and *it decides those rows are DEAD and
prunes them away* before we can fetch them.

The reason heap_page_prune thinks they're dead is that
GlobalVisTestIsRemovableXid told it so.

The reason GlobalVisTestIsRemovableXid thinks that is that
GlobalVisCatalogRels.maybe_needed is insane:

(gdb) p GlobalVisCatalogRels
$2 = {definitely_needed = {value = 16614}, maybe_needed = {
value = 18446744071709568230}}

The reason GlobalVisCatalogRels.maybe_needed is insane is that
(per pg_controldata) our latest checkpoint has

Latest checkpoint's NextXID: 0:16614
Latest checkpoint's oldestXID: 2294983910

so when GetSnapshotData computes

oldestfxid = FullXidRelativeTo(latest_completed, oldestxid);

it gets an insane value:

oldestfxid 18446744071709568230, latest_completed 16613, oldestxid 2294983910

And the reason oldestXID contains that is that pg_upgrade applied
pg_resetwal, which does this:

* For the moment, just set oldestXid to a value that will force
* immediate autovacuum-for-wraparound. It's not clear whether adding
* user control of this is useful, so let's just do something that's
* reasonably safe. The magic constant here corresponds to the
* maximum allowed value of autovacuum_freeze_max_age.
ControlFile.checkPointCopy.oldestXid = set_xid - 2000000000;
if (ControlFile.checkPointCopy.oldestXid < FirstNormalTransactionId)
ControlFile.checkPointCopy.oldestXid += FirstNormalTransactionId;

So it seems like we should do some combination of these things:

1. Fix FullXidRelativeTo to be a little less trusting. It'd
probably be sane to make it return FirstNormalTransactionId
when it'd otherwise produce a wrapped-around FullXid, but is
there any situation where we'd want it to throw an error instead?

2. Change pg_resetwal to not do the above. It's not entirely
apparent to me what business it has trying to force
autovacuum-for-wraparound anyway, but if it does need to do that,
can we devise a less klugy method?

It also seems like some assertions in procarray.c would be a
good idea. With the attached patch, we get through core
regression just fine, but the pg_upgrade test fails immediately
after the "Resetting WAL archives" step.

BTW, it looks like the failing code all came in with dc7420c2c
(snapshot scalability), so this fails to explain the similar-looking
field reports we've seen. But it's clearly a live bug in v14.
I shall go add it to the open items.

regards, tom lane

Attachment Content-Type Size
visibility-assertions.patch text/x-diff 2.0 KB

In response to


Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Geoghegan 2021-05-16 20:34:06 Re: prion failed with ERROR: missing chunk number 0 for toast value 14334 in pg_toast_2619
Previous Message Nitin Jadhav 2021-05-16 18:22:41 Re: Query about time zone patterns in to_char