Re: Inadequate thought about buffer locking during hot standby replay

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Inadequate thought about buffer locking during hot standby replay
Date: 2012-11-11 23:24:23
Message-ID: 8807.1352676263@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Attached is a complete draft patch against HEAD for this issue. I found
a depressingly large number of pre-existing bugs:

Practically all WAL record types that touch multiple pages have some
bug of this type. In addition to btree_xlog_split, I found that
heap_xlog_update, ginRedoDeletePage, spgRedoAddLeaf, spgRedoMoveLeafs,
spgRedoAddNode, spgRedoSplitTuple, and spgRedoPickSplit fail to hold
locks as required to make their updates safe for concurrent queries.
(I'm not totally sure about ginRedoDeletePage, but the original action
definitely locks the pages simultaneously, and it's not clear that it's
safe not to.) Most of these are okay in cases without any full-page
images, but could fail if the wrong subset of the pages-to-be-touched
were processed by RestoreBkpBlocks. Some had bugs even without that :-(

Worse than that, GIST WAL replay seems to be a total disaster from this
standpoint, and I'm not convinced that it's even fixable without
incompatible WAL changes. There are several cases where index
insertion operations generate more than one WAL record while holding
exclusive lock on buffer(s). If the lock continuity is actually
necessary to prevent concurrent queries from seeing inconsistent index
states, then we have a big problem, because WAL replay isn't designed to
hold any buffer lock for more than one WAL action. This needs to be
looked at closer by somebody who's more up on the GIST code than I am.
The attached patch doesn't try very hard to make the GIST functions
safe, it just transposes them to the new API.

I also found that spgRedoAddNode could fail to update the parent
downlink at all, if the parent tuple is in the same page as either the
old or new split tuple --- it would get fooled by the LSN having been
advanced already. This is an index corruption bug not just a
transient-failure problem.

Also, ginHeapTupleFastInsert's "merge lists" case fails to mark the old
tail page as a candidate for a full-page image; in the worst case this
could result in torn-page corruption.

I already pointed out the inconsistency in heap_xlog_freeze about
whether a cleanup lock is needed. As is, this patch uses a cleanup
lock, but I suspect that a regular lock is sufficient --- comments?

Also, gistRedoPageDeleteRecord seems to be dead code? I don't see
anything that emits XLOG_GIST_PAGE_DELETE.

regards, tom lane

#application/octet-stream [restore-backup-blocks-individually-2.patch.gz] /home/tgl/pgsql/restore-backup-blocks-individually-2.patch.gz

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2012-11-11 23:25:57 Re: Inadequate thought about buffer locking during hot standby replay
Previous Message Jeff Davis 2012-11-11 22:52:20 Re: Enabling Checksums