Re: StandbyRecoverPreparedTransactions recovers subtrans links incorrectly

From: Simon Riggs <simon(at)2ndquadrant(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: StandbyRecoverPreparedTransactions recovers subtrans links incorrectly
Date: 2017-04-23 10:05:44
Message-ID: CANP8+jLy1VdopsdCi3c0=msJCfu-oHAhgPJgtKJ84fFN_8EmOw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 23 April 2017 at 01:19, Andres Freund <andres(at)anarazel(dot)de> wrote:
> On 2017-04-22 19:55:18 -0400, Tom Lane wrote:
>> Now that we've got consistent failure reports about the 009_twophase.pl
>> recovery test, I set out to find out why it's failing. It looks to me
>> like the reason is that this (twophase.c:2145):
>>
>> SubTransSetParent(xid, subxid, overwriteOK);
>>
>> ought to be this:
>>
>> SubTransSetParent(subxid, xid, overwriteOK);
>>
>> because the definition of SubTransSetParent is
>>
>> void
>> SubTransSetParent(TransactionId xid, TransactionId parent, bool overwriteOK)
>>
>> not the other way 'round.
>>
>> While "git blame" blames this line on the recent commit 728bd991c,
>> that just moved the call from somewhere else. AFAICS this has actually
>> been wrong since StandbyRecoverPreparedTransactions was written,
>> in 361bd1662 of 2010-04-13.
>
>> Also, when I fix that, it gets further but still crashes at the same
>> Assert in SubTransSetParent. The proximate cause this time seems to be
>> that RecoverPreparedTransactions's calculation of overwriteOK is wrong:
>> it's computing that as "false", but in reality the subtrans link in
>> question has already been set.
>>
>
> Yikes. This is clearly way undertested. It's also pretty scary that
> the code has recently been whacked out quite heavily (both 9.6 and
> master), without hitting anything around this - doesn't seem to bode
> well for how in-depth the testing was.

Obviously if there is a bug it's because tests didn't find it and
therefore it is by definition undertested for that specific bug.

I'm not sure what other conclusion you wish to draw, if any?

>> It's not clear to me how much potential this has to create user data
>> corruption, but it doesn't look good at first glance. Discuss.
>
> Hm. I think it can cause wrong tqual.c results in some edge cases.
> During HS, lastOverflowedXid will be set in some cases, and then
> XidInMVCCSnapshot etc will not find the actual toplevel xid, which'll in
> turn cause lookups snapshot->subxip (where all HS xids reside)
> potentially return wrong results.
>
> I was about to say that I don't see how it could result in persistent
> corruption however - the subtrans lookups are only done for
> (snapshot->xmin, snapshot->xmax] and subtrans is truncated
> regularly. But these days CHECKPOINT_END_OF_RECOVERY isn't obligatory
> anymore, so that might be delayed. Hm.

I've not found any reason, yet, to believe we return wrong answers in
any case, even though the transient data structure pg_subtrans is
corrupted by the switched call Tom discovers.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Simon Riggs 2017-04-23 10:10:33 Re: [COMMITTERS] pgsql: Replication lag tracking for walsenders
Previous Message Simon Riggs 2017-04-23 09:59:32 Re: StandbyRecoverPreparedTransactions recovers subtrans links incorrectly