Re: Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, Steve Kehlet <steve(dot)kehlet(at)gmail(dot)com>, Forums postgresql <pgsql-general(at)postgresql(dot)org>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1
Date: 2015-06-03 20:53:46
Message-ID: CA+Tgmoa-pd_brvS8iX25Wm58TY37p+1iDTGseWiTFDX053xZNw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general pgsql-hackers

On Wed, Jun 3, 2015 at 8:24 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Tue, Jun 2, 2015 at 5:22 PM, Andres Freund <andres(at)anarazel(dot)de> wrote:
>>> > Hm. If GetOldestMultiXactOnDisk() gets the starting point by scanning
>>> > the disk it'll always get one at a segment boundary, right? I'm not sure
>>> > that's actually ok; because the value at the beginning of the segment
>>> > can very well end up being a 0, as MaybeExtendOffsetSlru() will have
>>> > filled the page with zeros.
>>> >
>>> > I think that should be harmless, the worst that can happen is that
>>> > oldestOffset errorneously is 0, which should be correct, even though
>>> > possibly overly conservative, in these cases.
>>>
>>> Uh oh. That seems like a real bad problem for this approach. What
>>> keeps that from being the opposite of too conservative? There's no
>>> "safe" value in a circular numbering space.
>>
>> I think it *might* (I'm really jetlagged) be fine because that'll only
>> happen after a upgrade from < 9.3. And in that case we initialize
>> nextOffset to 0. That ought to safe us?
>
> That's pretty much betting the farm on the bugs we know about today
> being the only ones there are. That seems imprudent.

So here's a patch taking a different approach. In this approach, if
the multixact whose members we want to look up doesn't exist, we don't
use a later one (that might or might not be valid). Instead, we
attempt to cope with the unknown. That means:

1. In TruncateMultiXact(), we don't truncate.

2. If setting the offset stop limit (the point where we refuse to
create new multixact space), we don't arm the stop point. This means
that if you're in this situation, you run without member wraparound
protection until it's corrected. A message gets logged once per
checkpoint telling you that you have this problem, and another message
gets logged when things get straightened out and the guards are
enabled.

3. If setting the vacuum force point, we assume that it's appropriate
to immediately force vacuum.

I've only tested this very lightly - this is just to see what you and
Noah and others think of the approach. As compared with the previous
approach, it has the advantage of making minimal assumptions about the
sanity of what's on disk. It has the disadvantage that, for some
people, the member-wraparound guard won't be enabled at startup -- but
note that those people can't start 9.3.7/9.4.2 *at all*, so currently
they are either running without member wraparound protection anyway
(if they haven't upgraded to those releases) or they're down entirely.
Another disadvantage is that we'll be triggering what may be quite a
bit of autovacuum activity for some people, which could be painful.
On the plus side, they'll hopefully end up with sane relminmxid and
datminmxid guards afterwards.

Thoughts?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment Content-Type Size
multixact-member-soft-fail.patch application/x-patch 16.7 KB

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Alvaro Herrera 2015-06-03 22:05:44 Re: [HACKERS] Re: 9.4.1 -> 9.4.2 problem: could not access status of transaction 1
Previous Message Alvaro Herrera 2015-06-03 19:04:47 Re: Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

Browse pgsql-hackers by date

  From Date Subject
Next Message Jim Nasby 2015-06-03 21:17:09 Re: [PATCH] Document that directly callable functions may use fn_extra
Previous Message Andrew Dunstan 2015-06-03 20:22:38 Re: Re: [COMMITTERS] pgsql: Map basebackup tablespaces using a tablespace_map file