Re: REINDEX INDEX results in a crash for an index of pg_class since 9.6

From: Andres Freund <andres(at)anarazel(dot)de>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Michael Paquier <michael(at)paquier(dot)xyz>, Postgres hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: REINDEX INDEX results in a crash for an index of pg_class since 9.6
Date: 2019-05-01 01:36:38
Message-ID: 20190501013638.7u7fwr4zktbyyng5@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2019-04-30 15:53:07 -0700, Andres Freund wrote:
> Hi,
>
> On 2019-04-30 18:42:36 -0400, Tom Lane wrote:
> > markhor just reported in with results showing that we have worse
> > problems than deadlock-prone tests in the back branches: 9.4
> > for example looks like
>
> > -- whole tables
> > REINDEX TABLE pg_class; -- mapped, non-shared, critical
> > + ERROR: could not read block 0 in file "base/16384/27769": read only 0 of 8192 bytes
>
> Ugh. Also failed on 9.6.

I see the bug. Turns out we need to figure out another way to solve the
assertion triggered by doing catalog updates within
RelationSetNewRelfilenode() - we can't just move the
SetReindexProcessing() before it. When CCA is enabled, the
CommandCounterIncrement() near the tail of RelationSetNewRelfilenode()
triggers a rebuild of the catalog entries - but without the
SetReindexProcessing() those scans will try to use the index currently
being rebuilt. Which then predictably fails:

#0 mdread (reln=0x5600aea36498, forknum=MAIN_FORKNUM, blocknum=0, buffer=0x7f71037db800 "") at /home/andres/src/postgresql/src/backend/storage/smgr/md.c:633
#1 0x00005600ae3f656f in smgrread (reln=0x5600aea36498, forknum=MAIN_FORKNUM, blocknum=0, buffer=0x7f71037db800 "")
at /home/andres/src/postgresql/src/backend/storage/smgr/smgr.c:590
#2 0x00005600ae3b4c13 in ReadBuffer_common (smgr=0x5600aea36498, relpersistence=112 'p', forkNum=MAIN_FORKNUM, blockNum=0, mode=RBM_NORMAL, strategy=0x0,
hit=0x7fff5bb11cab) at /home/andres/src/postgresql/src/backend/storage/buffer/bufmgr.c:896
#3 0x00005600ae3b44ab in ReadBufferExtended (reln=0x7f7107972540, forkNum=MAIN_FORKNUM, blockNum=0, mode=RBM_NORMAL, strategy=0x0)
at /home/andres/src/postgresql/src/backend/storage/buffer/bufmgr.c:664
#4 0x00005600ae3b437f in ReadBuffer (reln=0x7f7107972540, blockNum=0) at /home/andres/src/postgresql/src/backend/storage/buffer/bufmgr.c:596
#5 0x00005600ae00e0b3 in _bt_getbuf (rel=0x7f7107972540, blkno=0, access=1) at /home/andres/src/postgresql/src/backend/access/nbtree/nbtpage.c:805
#6 0x00005600ae00dd2a in _bt_heapkeyspace (rel=0x7f7107972540) at /home/andres/src/postgresql/src/backend/access/nbtree/nbtpage.c:694
#7 0x00005600ae01679c in _bt_first (scan=0x5600aea44440, dir=ForwardScanDirection) at /home/andres/src/postgresql/src/backend/access/nbtree/nbtsearch.c:1237
#8 0x00005600ae012617 in btgettuple (scan=0x5600aea44440, dir=ForwardScanDirection) at /home/andres/src/postgresql/src/backend/access/nbtree/nbtree.c:247
#9 0x00005600ae005572 in index_getnext_tid (scan=0x5600aea44440, direction=ForwardScanDirection)
at /home/andres/src/postgresql/src/backend/access/index/indexam.c:550
#10 0x00005600ae00571e in index_getnext_slot (scan=0x5600aea44440, direction=ForwardScanDirection, slot=0x5600ae9c6ed0)
at /home/andres/src/postgresql/src/backend/access/index/indexam.c:642
#11 0x00005600ae003e54 in systable_getnext (sysscan=0x5600aea44080) at /home/andres/src/postgresql/src/backend/access/index/genam.c:450
#12 0x00005600ae564292 in ScanPgRelation (targetRelId=1259, indexOK=true, force_non_historic=false)
at /home/andres/src/postgresql/src/backend/utils/cache/relcache.c:365
#13 0x00005600ae568203 in RelationReloadNailed (relation=0x5600aea0c4d0) at /home/andres/src/postgresql/src/backend/utils/cache/relcache.c:2292
#14 0x00005600ae568621 in RelationClearRelation (relation=0x5600aea0c4d0, rebuild=true) at /home/andres/src/postgresql/src/backend/utils/cache/relcache.c:2425
#15 0x00005600ae569081 in RelationCacheInvalidate () at /home/andres/src/postgresql/src/backend/utils/cache/relcache.c:2858
#16 0x00005600ae55b32b in InvalidateSystemCaches () at /home/andres/src/postgresql/src/backend/utils/cache/inval.c:649
#17 0x00005600ae55b408 in AcceptInvalidationMessages () at /home/andres/src/postgresql/src/backend/utils/cache/inval.c:708
#18 0x00005600ae3d7b22 in LockRelationOid (relid=1259, lockmode=1) at /home/andres/src/postgresql/src/backend/storage/lmgr/lmgr.c:136
#19 0x00005600adf85ad2 in relation_open (relationId=1259, lockmode=1) at /home/andres/src/postgresql/src/backend/access/common/relation.c:56
#20 0x00005600ae040337 in table_open (relationId=1259, lockmode=1) at /home/andres/src/postgresql/src/backend/access/table/table.c:43
#21 0x00005600ae564215 in ScanPgRelation (targetRelId=2662, indexOK=false, force_non_historic=false)
at /home/andres/src/postgresql/src/backend/utils/cache/relcache.c:348
#22 0x00005600ae567ecf in RelationReloadIndexInfo (relation=0x7f7107972540) at /home/andres/src/postgresql/src/backend/utils/cache/relcache.c:2170
#23 0x00005600ae5681d3 in RelationReloadNailed (relation=0x7f7107972540) at /home/andres/src/postgresql/src/backend/utils/cache/relcache.c:2270
#24 0x00005600ae568621 in RelationClearRelation (relation=0x7f7107972540, rebuild=true) at /home/andres/src/postgresql/src/backend/utils/cache/relcache.c:2425
#25 0x00005600ae568d19 in RelationFlushRelation (relation=0x7f7107972540) at /home/andres/src/postgresql/src/backend/utils/cache/relcache.c:2686
#26 0x00005600ae568e32 in RelationCacheInvalidateEntry (relationId=2662) at /home/andres/src/postgresql/src/backend/utils/cache/relcache.c:2738
#27 0x00005600ae55b1af in LocalExecuteInvalidationMessage (msg=0x5600aea262a8) at /home/andres/src/postgresql/src/backend/utils/cache/inval.c:589
#28 0x00005600ae55af06 in ProcessInvalidationMessages (hdr=0x5600aea26250, func=0x5600ae55b0a8 <LocalExecuteInvalidationMessage>)

I can think of several ways to properly fix this:

1) Remove the CommandCounterIncrement() from
RelationSetNewRelfilenode(), move it to the callers. That would allow
for, I think, proper sequencing in reindex_index():

/*
* Create a new relfilenode - note that this doesn't make the new
* relfilenode visible yet, we'd otherwise run into danger of that
* index (which is empty at this point) being used while processing
* cache invalidations.
*/
RelationSetNewRelfilenode(iRel, persistence);

/*
* Before making the new relfilenode visible, prevent its use of the
* to-be-reindexed index while building it.
*/
SetReindexProcessing(heapId, indexId);

CommandCounterIncrement();

2) Separate out the state for the assertion triggered by
SetReindexProcessing from the prohibition of the use of the index for
searches.

3) Turn on REINDEX_REL_SUPPRESS_INDEX_USE mode when reindexing
pg_class. But that seems like a bigger hammer than necessary?

Sidenote: It'd be pretty helpful to have an option for the buildfarm etc
to turn md.c type errors like this into PANICs.

> > Given this, I'm rethinking my position that we can dispense with these
> > test cases. Let's try putting them in a standalone test script, and
> > see whether that leads to failures or not. Even if it does, we'd
> > better keep them until we've got a fully clean bill of health from
> > the buildfarm.
>
> Yea. Seems likely this indicates a proper, distinct, bug :/
>
> I'll move the test into a new "reindex_catalog" test, with a comment
> explaining that the failure cases necessitating that are somewhere
> between bugs, ugly warts, an hard to fix edge cases.

Just pushed that.

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Geoghegan 2019-05-01 01:43:08 Re: Adding a test for speculative insert abort case
Previous Message Melanie Plageman 2019-05-01 01:34:42 Re: Adding a test for speculative insert abort case