Re: BUG #17255: Server crashes in index_delete_sort_cmp() due to race condition with vacuum

From: Dmitry Dolgov <9erthalion6(at)gmail(dot)com>
To: Alexander Lakhin <exclusion(at)gmail(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Peter Geoghegan <pg(at)bowt(dot)ie>, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #17255: Server crashes in index_delete_sort_cmp() due to race condition with vacuum
Date: 2021-11-08 17:28:52
Message-ID: 20211108172852.pvsarxoxi54uatch@localhost
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

> On Sun, Nov 07, 2021 at 09:00:00PM +0300, Alexander Lakhin wrote:
> 31.10.2021 22:20, Dmitry Dolgov wrote:
> >>
> >> I suspect this is the same bug as #17245. Could you check if it's fixed by
> >> https://www.postgresql.org/message-id/CAH2-WzkN5aESSLfK7-yrYgsXxYUi__VzG4XpZFwXm98LUtoWuQ%40mail.gmail.com
> >>
> >> The crash is somewhere in pg_class, which is also manually VACUUMed by the
> >> test, which could trigger the issue we found in the other thread. The likely
> >> reason the loop in the repro is needed is that that'll push one of the indexes
> >> on pg_class over the 512kb/min_parallel_index_scan_size boundary to start
> >> using paralell vacuum.
> > I've applied both patches from Peter, the fix itself and
> > index-points-to-LP_UNUSED-item assertions. Now it doesn't crash on
> > pg_unreachable, but hits those extra assertions in the second patch:
> Yes, the committed fix for the bug #17245 doesn't help here.
> I've also noticed that the server crash is not the only possible
> outcome. You can also get unexpected errors like:
> ERROR:  relation "errtst_parent" already exists
> ERROR:  relation "tmp_idx1" already exists
> ERROR:  relation "errtst_child_plaindef" already exists
> or
> ERROR:  could not open relation with OID 1033921
> STATEMENT:  DROP TABLE errtst_parent;
> in the server.log (and no crash).

Interesting, I don't think I've observed those errors. In fact after the
recent changes (I've compiled here from 39a31056) around assertion logic
and index_delete_check_htid now I'm getting another type of crashes
using your scripts. This time heap_page_prune_execute stumbles upon a
non heap-only tuple trying to update unused line pointers:

#0 0x00007f0dfce072fb in raise () from /lib64/libc.so.6
#1 0x00007f0dfcdf0ef6 in abort () from /lib64/libc.so.6
#2 0x000055a66a87db05 in ExceptionalCondition (conditionName=0x55a66a914610 "HeapTupleHeaderIsHeapOnly(htup)", errorType=0x55a66a91419c "FailedAssertion", fileName=0x55a66a914190 "pruneheap.c", lineNumber=961) at assert.c:69
#3 0x000055a66a21b83a in heap_page_prune_execute (buffer=138, redirected=0x7fffac638994, nredirected=0, nowdead=0x7fffac638e20, ndead=12, nowunused=0x7fffac639066, nunused=1) at pruneheap.c:961
#4 0x000055a66a219d82 in heap_page_prune (relation=0x7f0dfdca0e20, buffer=138, vistest=0x55a66addf140 <GlobalVisCatalogRels>, old_snap_xmin=0, old_snap_ts=0, report_stats=true, off_loc=0x0) at pruneheap.c:295

The relation in question is pg_class, and htup apparently has
e.g. HEAP_KEYS_UPDATED, but no HEAP_ONLY_TUPLE flags set:

>>> p *htup
$1 = {
t_choice = {
t_heap = {
t_xmin = 661929,
t_xmax = 662015,
t_field3 = {
t_cid = 0,
t_xvac = 0
}
},
t_datum = {
datum_len_ = 661929,
datum_typmod = 662015,
datum_typeid = 0
}
},
t_ctid = {
ip_blkid = {
bi_hi = 0,
bi_lo = 2004
},
ip_posid = 128
},
t_infomask2 = 8225,
t_infomask = 1281,
t_hoff = 32 ' ',
t_bits = 0x7f0de7080ee7 "\377\377\377?"
}

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Andres Freund 2021-11-08 17:51:13 Re: BUG #17268: Possible corruption in toast index after reindex index concurrently
Previous Message PG Bug reporting form 2021-11-08 17:28:00 BUG #17275: pg_waldump files split across postgresql14 and postgresql14-server packages