On Thu, Apr 19, 2012 at 7:55 AM, Noah Misch <noah(at)leadboat(dot)com> wrote:
> On Mon, Mar 12, 2012 at 10:50:36PM -0400, Tom Lane wrote:
>> There is one more (known) stop-ship problem in SPGiST, which I'd kind of
>> like to get out of the way now before I let my knowledge of that code
>> get swapped out again. This is that SPGiST is unsafe for use by hot
>> standby slaves.
> I suspect that swap-out has passed, but ...
>> The problem comes from "redirect" tuples, which are short-lifespan
>> objects that replace a tuple that's been moved to another page.
>> A redirect tuple can be recycled as soon as no active indexscan could
>> be "in flight" from the parent index page to the moved tuple. SPGiST
>> implements this by marking each redirect tuple with the XID of the
>> creating transaction, and assuming that the tuple can be recycled once
>> that XID is below the OldestXmin horizon (implying that all active
>> transactions started after it ended). This is fine as far as
>> transactions on the master are concerned, but there is no guarantee that
>> the recycling WAL record couldn't be replayed on a hot standby slave
>> while there are still HS transactions that saw the old state of the
>> parent index tuple.
>> Now, btree has a very similar problem with deciding when it's safe to
>> recycle a deleted index page: it has to wait out transactions that could
>> be in flight to the page, and it does that by marking deleted pages with
>> XIDs. I see that the problem has been patched for btree by emitting a
>> special WAL record just before a page is recycled. However, I'm a bit
>> nervous about copying that solution, because the details are a bit
>> different. In particular, I see that btree marks deleted pages with
>> ReadNewTransactionId() --- that is, the next-to-be-assigned XID ---
>> rather than the XID of the originating transaction, and then it
>> subtracts one from the XID before sending it to the WAL stream.
>> The comments about this are not clear enough for me, and so I'm
> Attempting to write an explanation for that btree code led me conclude that
> the code is incorrect. (FWIW, I caused that wrongness.) I will start a
> separate thread to fix it.
Wrong or not, we need to better document why we picked
ReadNewTransactionID(), rather than OldestXmin which seems the more
obvious and cheaper choice.
> All hot standby transactions holding snapshots taken before the startup
> process applies the tuple-mover transaction's commit record will have xmin <=
> its XID. Therefore, passing that XID to ResolveRecoveryConflictWithSnapshot()
> meets the need here precisely.
Yes, agreed. i.e. don't subtract 1.
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
In response to
pgsql-hackers by date
|Next:||From: Susanne Ebrecht||Date: 2012-04-19 07:34:22|
|Subject: Re: Bug tracker tool we need|
|Previous:||From: Jeff Davis||Date: 2012-04-19 07:10:04|
|Subject: Re: 9.3 Pre-proposal: Range Merge Join|