|From:||Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>|
|To:||Heikki Linnakangas <hlinnaka(at)iki(dot)fi>|
|Cc:||pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Justin Pryzby <pryzby(at)telsasoft(dot)com>|
|Subject:||Re: prion failed with ERROR: missing chunk number 0 for toast value 14334 in pg_toast_2619|
|Views:||Raw Message | Whole Thread | Download mbox | Resend email|
Heikki Linnakangas <hlinnaka(at)iki(dot)fi> writes:
> After my commit c532d15ddd to split up copy.c, buildfarm animal "prion"
> failed in pg_upgrade
prion's continued to fail, rarely, in this same way:
The failures are remarkably identical, and they also look a lot like
field reports we've been seeing off and on for years. I do not know
why it always seems to be pg_toast_2619 (i.e. pg_statistic) that's
affected, but the pattern is pretty undeniable by now.
What I do have that's new is that *I can reproduce it*, at long last.
For me, installing the attached patch and running pg_upgrade's
"make check" fails, pretty much every time, with symptoms identical
The patch consists of
(1) 100ms delay just before detoasting, when loading a pg_statistic
catcache entry that has toasted datums
(2) provisions to mark such catcache entries dead immediately
(effectively, CATCACHE_FORCE_RELEASE for only these tuples);
this is to force us through (1) as often as possible
(3) some quick-hack debugging aids added to HeapTupleSatisfiesToast,
plus convert the missing-chunk error to PANIC to get a stack
If it doesn't reproduce for you, try adjusting the delay. 100ms
was the first value I tried, though, so I think it's probably
not too sensitive.
The trace I'm getting shows pretty positively that autovacuum
has fired on pg_statistic, and removed the needed toast entries,
just before the failure. So something is wrong with our logic
about when toast entries can be removed.
I do not have a lot of idea why, but I see something that is
probably a honkin' big clue:
2021-05-15 17:28:05.965 EDT  LOG: HeapTupleSatisfiesToast: xmin 2 t_infomask 0x0b02
That is, the toast tuples in question are not just frozen, but
actually have xmin = FrozenTransactionId.
I do not think that is normal --- at least, it's not the state
immediately after initdb, and I can't make it happen with
"vacuum freeze pg_statistic". A plausible theory is that pg_upgrade
caused this to happen (but how?) and then vacuuming of toast rows
goes off the rails because of it.
Anyway, I have no more time to poke at this right now, so I'm
posting the reproducer in case someone else wants to look at it.
regards, tom lane
|Next Message||Alvaro Herrera||2021-05-15 23:01:25||Re: compute_query_id and pg_stat_statements|
|Previous Message||Bruce Momjian||2021-05-15 21:32:58||Re: compute_query_id and pg_stat_statements|