Noah Misch <noah(at)leadboat(dot)com> writes:
> On Mon, Mar 12, 2012 at 10:50:36PM -0400, Tom Lane wrote:
>> There is one more (known) stop-ship problem in SPGiST, which I'd kind of
>> like to get out of the way now before I let my knowledge of that code
>> get swapped out again. This is that SPGiST is unsafe for use by hot
>> standby slaves.
>> The problem comes from "redirect" tuples, which are short-lifespan
>> objects that replace a tuple that's been moved to another page.
>> A redirect tuple can be recycled as soon as no active indexscan could
>> be "in flight" from the parent index page to the moved tuple. SPGiST
>> implements this by marking each redirect tuple with the XID of the
>> creating transaction, and assuming that the tuple can be recycled once
>> that XID is below the OldestXmin horizon (implying that all active
>> transactions started after it ended). This is fine as far as
>> transactions on the master are concerned, but there is no guarantee that
>> the recycling WAL record couldn't be replayed on a hot standby slave
>> while there are still HS transactions that saw the old state of the
>> parent index tuple.
>> Now, btree has a very similar problem with deciding when it's safe to
>> recycle a deleted index page: it has to wait out transactions that could
>> be in flight to the page, and it does that by marking deleted pages with
> Attempting to write an explanation for that btree code led me conclude that
> the code is incorrect. (FWIW, I caused that wrongness.) I will start a
> separate thread to fix it.
After reviewing this thread and the one at
I believe that the SPGiST issue should be fixed as follows:
* It's okay for transactions inserting redirection tuples to use their
own XID as the marker.
* In spgvacuum.c, the test in vacuumRedirectAndPlaceholder to decide if
a redirection tuple is recyclable should use
rather than comparing against OldestXmin as it does now. (This allows
some consequent simplification since the module need not pass around
an OldestXmin parameter.)
* vacuumRedirectAndPlaceholder must also compute the newest XID among
the redirection tuples it removes from the page, and store that in
a new field of XLOG_SPGIST_VACUUM_REDIRECT WAL records.
* In spgRedoVacuumRedirect, call ResolveRecoveryConflictWithSnapshot
with the newest-redirect XID.
I had first thought that we could avoid a change in WAL contents by
having spgRedoVacuumRedirect compute the cutoff XID for itself by
examining the removed tuples, but that doesn't work: XLogInsert might
for instance have decided to store a full-page image, in which case the
information isn't available by inspection of the old page contents.
But we still have to enforce the interlock against hot standby xacts.
In principle we should bump the XLOG page magic number for this change
in WAL contents, but I'm inclined not to because it'd cause pain for
beta testers, and there are probably very few people who'd have live
XLOG_SPGIST_VACUUM_REDIRECT records in play when they switch to beta3.
regards, tom lane
In response to
pgsql-hackers by date
|Next:||From: Marko Kreen||Date: 2012-08-02 20:24:42|
|Subject: Re: [patch] libpq one-row-at-a-time API|
|Previous:||From: Peter Geoghegan||Date: 2012-08-02 18:04:51|
|Subject: Re: Help me develop new commit_delay advice|