Re: Re: SPGiST versus hot standby - question about conflict resolution rules

From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Noah Misch <noah(at)leadboat(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Re: SPGiST versus hot standby - question about conflict resolution rules
Date: 2012-04-19 07:12:16
Message-ID: CA+U5nMKVu2BRug_ORh1FGYEiQ5jKPfkUecCzUUp=VAwU0VTg8w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Apr 19, 2012 at 7:55 AM, Noah Misch <noah(at)leadboat(dot)com> wrote:
> On Mon, Mar 12, 2012 at 10:50:36PM -0400, Tom Lane wrote:
>> There is one more (known) stop-ship problem in SPGiST, which I'd kind of
>> like to get out of the way now before I let my knowledge of that code
>> get swapped out again.  This is that SPGiST is unsafe for use by hot
>> standby slaves.
>
> I suspect that swap-out has passed, but ...
>
>> The problem comes from "redirect" tuples, which are short-lifespan
>> objects that replace a tuple that's been moved to another page.
>> A redirect tuple can be recycled as soon as no active indexscan could
>> be "in flight" from the parent index page to the moved tuple.  SPGiST
>> implements this by marking each redirect tuple with the XID of the
>> creating transaction, and assuming that the tuple can be recycled once
>> that XID is below the OldestXmin horizon (implying that all active
>> transactions started after it ended).  This is fine as far as
>> transactions on the master are concerned, but there is no guarantee that
>> the recycling WAL record couldn't be replayed on a hot standby slave
>> while there are still HS transactions that saw the old state of the
>> parent index tuple.
>>
>> Now, btree has a very similar problem with deciding when it's safe to
>> recycle a deleted index page: it has to wait out transactions that could
>> be in flight to the page, and it does that by marking deleted pages with
>> XIDs.  I see that the problem has been patched for btree by emitting a
>> special WAL record just before a page is recycled.  However, I'm a bit
>> nervous about copying that solution, because the details are a bit
>> different.  In particular, I see that btree marks deleted pages with
>> ReadNewTransactionId() --- that is, the next-to-be-assigned XID ---
>> rather than the XID of the originating transaction, and then it
>> subtracts one from the XID before sending it to the WAL stream.
>> The comments about this are not clear enough for me, and so I'm
>
> Attempting to write an explanation for that btree code led me conclude that
> the code is incorrect.  (FWIW, I caused that wrongness.)  I will start a
> separate thread to fix it.

Wrong or not, we need to better document why we picked
ReadNewTransactionID(), rather than OldestXmin which seems the more
obvious and cheaper choice.

> All hot standby transactions holding snapshots taken before the startup
> process applies the tuple-mover transaction's commit record will have xmin <=
> its XID.  Therefore, passing that XID to ResolveRecoveryConflictWithSnapshot()
> meets the need here precisely.

Yes, agreed. i.e. don't subtract 1.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Susanne Ebrecht 2012-04-19 07:34:22 Re: Bug tracker tool we need
Previous Message Jeff Davis 2012-04-19 07:10:04 Re: 9.3 Pre-proposal: Range Merge Join