Re: Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, Steve Kehlet <steve(dot)kehlet(at)gmail(dot)com>, Forums postgresql <pgsql-general(at)postgresql(dot)org>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1
Date: 2015-06-01 18:22:32
Message-ID: CA+TgmobBGs2G_y74khFdZqc5GhtONMnFbVkPXunPtZPsB4ttRw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general pgsql-hackers

On Mon, Jun 1, 2015 at 4:58 AM, Andres Freund <andres(at)anarazel(dot)de> wrote:
>> I'm probably biased here, but I think we should finish reviewing,
>> testing, and committing my patch before we embark on designing this.
>
> Probably, yes. I am wondering whether doing this immediately won't end
> up making some things simpler and more robust though.

I'm open to being convinced of that, but as of this moment I'm not
seeing any clear-cut evidence that we need to go so far.

>> So far we have no reports of trouble attributable to the lack of the
>> WAL-logging support discussed here, as opposed to several reports of
>> trouble from the status quo within days of release.
>
> The lack of WAL logging actually has caused problems in the 9.3.3 (?)
> era, where we didn't do any truncation during recovery...

Right, but now we're piggybacking on the checkpoint records, and I
don't have any evidence that this approach can't be made robust. It's
possible that it can't be made robust, but that's not currently clear.

>> By the time we've reached the minimum recovery point, they will have
>> been recreated by the same WAL records that created them in the first
>> place.
>
> I'm not sure that's true. I think we could end up errorneously removing
> files that were included in the base backup. Anyway, let's focus on your
> patch for now.

OK, but I am interested in discussing the other thing too. I just
can't piece together the scenario myself - there may well be one. The
base backup will begin replay from the checkpoint caused by
pg_start_backup() and remove anything that wasn't there at the start
of the backup. But all of that stuff should get recreated by the time
we reach the minimum recovery point (end of backup).

>> If, in the previous
>> replay, we had wrapped all the way around, some of the stuff we keep
>> may actually already have been overwritten by future WAL records, but
>> they'll be overwritten again. Now, that could mess up our
>> determination of which members to remove, I guess, but I'm not clear
>> that actually matters either: if the offsets space has wrapped around,
>> the members space will certainly have wrapped around as well, so we
>> can remove anything we like at this stage and we're still OK. I agree
>> this is ugly the way it is, but where is the actual bug?
>
> I'm more worried about the cases where we didn't ever actually "badly
> wrap around" (i.e. overwrite needed data); but where that's not clear on
> the standby because the base backup isn't in a consistent state.

I agree. The current patch tries to make it so that we never call
find_multixact_start() while in recovery, but it doesn't quite
succeed: the call in TruncateMultiXact still happens during recovery,
but only once we're sure that the mxact we plan to call it on actually
exists on disk. That won't be called until we replay the first
checkpoint, but that might still be prior to consistency.

Since I forgot to attach the revised patch with fixes for the points
Noah mentioned to that email, here it is attached to this one.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment Content-Type Size
inconsistent-multixact-fix-master-v2.patch application/x-patch 13.1 KB

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Dorian Hoxha 2015-06-01 18:29:38 Re: Database designpattern - product feature
Previous Message Guillaume Lelarge 2015-06-01 18:13:42 Re: Is my standby fully connected?

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrew Gierth 2015-06-01 18:39:07 Re: Really bad blowups with hash outer join and nulls
Previous Message Josh Berkus 2015-06-01 18:09:31 Re: [CORE] postpone next week's release