Re: Bug in new buffer freelist code

From: Jan Wieck <JanWieck(at)Yahoo(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: Bug in new buffer freelist code
Date: 2003-12-23 20:41:49
Message-ID: 3FE8A88D.10309@Yahoo.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Tom Lane wrote:
> I just had the parallel regression tests hang up due to what appears to
> be a bug in the new ARC code. The CLUSTER test gets into an infinite
> loop trying to do "CLUSTER clstr_1;". The loop is in
> StrategyInvalidateBuffer's check that the buffer is already in the
> freelist; it isn't, and the freelist is circular.

It seems to me that buffers that are thrown away via
StrategyInvalidateBuffer() do not get their relnode and blocknum cleaned
out. That causes FlushRelationBuffers() while doing a full scan of the
whole buffer pool to find buffers that once contained the block again.

If buffer 839 once contained that block, and it was given up that way,
and later on buffer 850 contains it, there is a CDB for it. If now
FlushRelationBuffers() scans the buffer pool, it will find buffer 839
first and call StrategyInvalidateBuffer() for it. That finds the CDB for
buffer 850, and add's buffer 839 to the list again. Later on FlushRB()
calls StrategyIB() for buffer 850 and we have the situation at hand.

Does that make sense?

Jan

>
> (gdb) bt
> #0 0x1fe8a8 in StrategyInvalidateBuffer (buf=0xc3a56f60) at freelist.c:733
> #1 0x1fbf08 in FlushRelationBuffers (rel=0x400fa298, firstDelBlock=0)
> at bufmgr.c:1596
> #2 0x1479fc in swap_relfilenodes (r1=143786, r2=143915) at cluster.c:736
> #3 0x147458 in rebuild_relation (OldHeap=0x2322b, indexOid=143788)
> at cluster.c:455
> #4 0x1473b0 in cluster_rel (rvtc=0x7b03bed8, recheck=0 '\000')
> at cluster.c:395
> #5 0x146ff4 in cluster (stmt=0x400b88a8) at cluster.c:232
> #6 0x21c60c in ProcessUtility (parsetree=0x400b88a8, dest=0x400b88e8,
> completionTag=0x7b03bbe8 "") at utility.c:1033
> ... etc ...
>
> (gdb) p *buf
> $5 = {bufNext = -1, data = 7211904, tag = {rnode = {tblNode = 17142,
> relNode = 143906}, blockNum = 0}, buf_id = 850, flags = 14,
> refcount = 0, io_in_progress_lock = 1721, cntx_lock = 1722,
> cntxDirty = 0 '\000', wait_backend_id = 0}
> (gdb) p *StrategyControl
> $1 = {target_T1_size = 423, listUnusedCDB = 249, listHead = {464, 967, 1692,
> 1227}, listTail = {968, 645, 1528, 1694}, listSize = {364, 413, 584, 636},
> listFreeBuffers = 839, num_lookup = 546939, num_hit = {1378, 246896, 282639,
> 3935}, stat_report = 0, cdb = {{prev = 386, next = 23, list = 3,
> buf_tag = {rnode = {tblNode = 17142, relNode = 19080}, blockNum = 30},
> buf_id = -1, t1_xid = 3402}}}
> (gdb) p BufferDescriptors[839]
> $2 = {bufNext = 839, data = 7121792, tag = {rnode = {tblNode = 17142,
> relNode = 143906}, blockNum = 0}, buf_id = 839, flags = 14,
> refcount = 0, io_in_progress_lock = 1699, cntx_lock = 1700,
> cntxDirty = 0 '\000', wait_backend_id = 0}
>
> So we've got a couple of problems here: buffers 839 and 850 both claim
> to contain block 0 of rel 143906 (which is clstr_1), and the freelist
> is circular.
>
> This doesn't seem to be super reproducible, but there's definitely a
> problem in there somewhere.
>
> regards, tom lane

--
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================== JanWieck(at)Yahoo(dot)com #

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jan Wieck 2003-12-23 21:23:50 Re: Bug in new buffer freelist code
Previous Message Adam Witney 2003-12-23 20:24:57 One regression failure with 7.4.1 on Debian 3.0r2