Re: PANIC: wrong buffer passed to visibilitymap_clear

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Peter Geoghegan <pg(at)bowt(dot)ie>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: PANIC: wrong buffer passed to visibilitymap_clear
Date: 2021-04-11 05:04:27
Message-ID: 2644575.1618117467@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I've managed to reproduce this locally, by dint of running the
src/bin/scripts tests over and over and tweaking the timing by
trying different "taskset" parameters to vary the number of CPUs
available. I find that I duplicated the report from spurfowl,
particularly

(gdb) bt
#0 0x00007f67bb6807d5 in raise () from /lib64/libc.so.6
#1 0x00007f67bb669895 in abort () from /lib64/libc.so.6
#2 0x000000000094ce37 in errfinish (filename=<optimized out>,
lineno=<optimized out>,
funcname=0x9ac120 <__func__.1> "visibilitymap_clear") at elog.c:680
#3 0x0000000000488b8c in visibilitymap_clear (rel=rel(at)entry=0x7f67b2837330,
heapBlk=<optimized out>, buf=buf(at)entry=0, flags=flags(at)entry=3 '\003')
^^^^^^^^^^^^^^^
at visibilitymap.c:155
#4 0x000000000055cd87 in heap_update (relation=0x7f67b2837330,
otid=0x7f67b274744c, newtup=0x7f67b2747448, cid=<optimized out>,
crosscheck=<optimized out>, wait=<optimized out>, tmfd=0x7ffecf4d5700,
lockmode=0x7ffecf4d56fc) at heapam.c:3993
#5 0x000000000055dd61 in simple_heap_update (
relation=relation(at)entry=0x7f67b2837330, otid=otid(at)entry=0x7f67b274744c,
tup=tup(at)entry=0x7f67b2747448) at heapam.c:4211
#6 0x00000000005e531c in CatalogTupleUpdate (heapRel=0x7f67b2837330,
otid=0x7f67b274744c, tup=0x7f67b2747448) at indexing.c:309
#7 0x00000000006420f9 in update_attstats (relid=1255, inh=false,
natts=natts(at)entry=30, vacattrstats=vacattrstats(at)entry=0x19c9fc0)
at analyze.c:1758
#8 0x00000000006430dd in update_attstats (vacattrstats=0x19c9fc0, natts=30,
inh=false, relid=<optimized out>) at analyze.c:1646
#9 do_analyze_rel (onerel=<optimized out>, params=0x7ffecf4d5e50,
va_cols=0x0, acquirefunc=<optimized out>, relpages=86,
inh=<optimized out>, in_outer_xact=false, elevel=13) at analyze.c:589
#10 0x00000000006447a1 in analyze_rel (relid=<optimized out>,
relation=<optimized out>, params=params(at)entry=0x7ffecf4d5e50, va_cols=0x0,
in_outer_xact=<optimized out>, bstrategy=<optimized out>) at analyze.c:261
#11 0x00000000006a5718 in vacuum (relations=0x19c8158, params=0x7ffecf4d5e50,
bstrategy=<optimized out>, isTopLevel=<optimized out>) at vacuum.c:478
#12 0x00000000006a5c94 in ExecVacuum (pstate=pstate(at)entry=0x1915970,
vacstmt=vacstmt(at)entry=0x18ed5c8, isTopLevel=isTopLevel(at)entry=true)
at vacuum.c:254
#13 0x000000000083c32c in standard_ProcessUtility (pstmt=0x18ed918,
queryString=0x18eca20 "ANALYZE pg_catalog.pg_proc;",
context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0,
dest=0x18eda08, qc=0x7ffecf4d61c0) at utility.c:826

I'd not paid much attention to that point before, but now it
seems there is no question that heap_update is reaching line 3993

visibilitymap_clear(relation, BufferGetBlockNumber(buffer),
vmbuffer, VISIBILITYMAP_VALID_BITS);

without having changed "vmbuffer" from its initial value of
InvalidBuffer. It looks that way both at frame 3 and frame 4:

(gdb) f 4
#4 0x000000000055cd87 in heap_update (relation=0x7f67b2837330,
otid=0x7f67b274744c, newtup=0x7f67b2747448, cid=<optimized out>,
crosscheck=<optimized out>, wait=<optimized out>, tmfd=0x7ffecf4d5700,
lockmode=0x7ffecf4d56fc) at heapam.c:3993
3993 visibilitymap_clear(relation, BufferGetBlockNumber(buffer),
(gdb) i locals
...
vmbuffer = 0
vmbuffer_new = 0
...

It is also hard to doubt that somebody broke this in the last-minute
commit blizzard. There are no reports of this PANIC in the buildfarm for
the last month, but we're now up to four (last I checked) since Thursday.

While the first thing that comes to mind is a logic bug in heap_update
itself, that code doesn't seem to have changed much in the last few days.
Moreover, why is it that this only seems to be happening within
do_analyze_rel -> update_attstats? (We only have two stack traces
positively showing that, but all the buildfarm reports look like the
failure is happening within manual or auto analyze of a system catalog.
Fishy as heck.)

Just eyeing the evidence on hand, I'm wondering if something has decided
it can start setting the page-all-visible bit without adequate lock,
perhaps only in system catalogs. heap_update is clearly assuming that
that flag won't change underneath it, and if it did, it's clear how this
symptom would ensue.

Too tired to take it further tonight ... discuss among yourselves.

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Pavel Stehule 2021-04-11 07:48:35 Re: proposal: possibility to read dumped table's name from file
Previous Message Justin Pryzby 2021-04-11 04:16:58 Re: TRUNCATE on foreign table