Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Noah Misch <noah(at)leadboat(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Steve Kehlet <steve(dot)kehlet(at)gmail(dot)com>, Forums postgresql <pgsql-general(at)postgresql(dot)org>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1
Date: 2015-06-02 15:16:22
Message-ID: CA+TgmoaqVPa7sAuLOSY_LPy43an0pzoKjUj+jfx5No_PPoWx9Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general pgsql-hackers

On Tue, Jun 2, 2015 at 1:21 AM, Noah Misch <noah(at)leadboat(dot)com> wrote:
> On Mon, Jun 01, 2015 at 02:06:05PM -0400, Robert Haas wrote:
>> On Mon, Jun 1, 2015 at 12:46 AM, Noah Misch <noah(at)leadboat(dot)com> wrote:
>> > On Fri, May 29, 2015 at 03:08:11PM -0400, Robert Haas wrote:
>> >> SetMultiXactIdLimit() bracketed certain parts of its
>> >> logic with if (!InRecovery), but those guards were ineffective because
>> >> it gets called before InRecovery is set in the first place.
>> >
>> > SetTransactionIdLimit() checks InRecovery for the same things, and it is
>> > called at nearly the same moments as SetMultiXactIdLimit(). Do you have a
>> > sense of whether it is subject to similar problems as a result?
>>
>> Well, I think it's pretty weird that those things will get done before
>> beginning recovery, even on an inconsistent cluster, but not during
>> recovery. That is pretty strange. I don't see that it can actually
>> do any worse than emit a few log messages at the start of recovery
>> that won't show up again until the end of recovery, though.
>
> Granted. Would it be better to update both functions at the same time, and
> perhaps to make that a master-only change? Does the status quo cause more
> practical harm via SetMultiXactIdLimit() than via SetTransactionIdLimit()?

It does in the case of the call to find_multixact_start(). If that
fails, we take the server down for no good reason, as demonstrated by
the original report. I'll revert the changes to the other two things
in this function that I changed to be protected by did_trim. There's
no special reason to think that's a necessary change.

>> >> 1. Moves the call to DetermineSafeOldestOffset() that appears in
>> >> StartupMultiXact() to TrimMultiXact(), so that we don't try to do this
>> >> until we're consistent. Also, instead of passing
>> >> MultiXactState->oldestMultiXactId, pass the newer of that value and
>> >> the earliest offset that exists on disk. That way, it won't try to
>> >> read data that's not there.
>> >
>> > Perhaps emit a LOG message when we do that, since it's our last opportunity to
>> > point to the potential data loss?
>>
>> If the problem is just that somebody minmxid got set to 1 instead of
>> the real value, I think that there is no data loss, because none of
>> those older values are actually present there. But we could add a LOG
>> message anyway. How do you suggest that we phrase that?
>
> There's no data loss if 1 <= true_minmxid <= 2^31 at the time minmxid got set
> to 1. Otherwise, data loss is possible.

Yes, but in that scenario, the log message you propose wouldn't be
triggered. If true_minmxid > 2^31, then the stored minmxid will not
precede the files on disk; it will follow it (assuming the older stuff
hasn't been truncated, as is likely). So the message would be
essentially:

LOG: you didn't lose data. but if exactly the opposite of what this
message is telling you about had happened, then you would have.
DETAIL: Have a nice day.

> I don't hope for an actionable
> message, but we might want a reporter to grep logs for it when we diagnose
> future reports. Perhaps this:
>
> "missing pg_multixact/members files; begins at MultiXactId %u, expected %u"

This seems misleading. In the known failure case, it's not that the
pg_multixact files are unexpectedly missing; it's that we incorrectly
think that they should still be there. Maybe:

oldest MultiXactId on disk %u follows expected oldest MultiXact %u

> For good measure, perhaps emit this when lastCheckpointedOldest > earliest by
> more than one segment:
>
> "excess pg_multixact/members files; begins at MultiXactId %u, expected %u"

So, this scenario will happen whenever the system was interrupted in
the middle of a truncation, or when the system is started from a base
backup and a truncation happened after files were copied. I'm wary of
giving users the idea that this is an atypical event. Perhaps a
message at DEBUG1?

>> I'm not sure what you mean about it becoming too old. At least with
>> that fix, it should get updated to exactly the first file that we
>> didn't remove. Isn't that right?
>
> Consider a function raw_GOMXOD() that differs from GetOldestMultiXactOnDisk()
> only in that it never reads or writes the cache. I might expect
> oldestMultiXactOnDisk==raw_GOMXOD() if oldestMultiXactOnDiskValid, and that
> does hold most of the time. It does not always hold between the start of the
> quoted code's SimpleLruTruncate() and its oldestMultiXactOnDisk assignment.
> That's because raw_GOMXOD() changes at the instant we unlink the oldest
> segment, but oldestMultiXactOnDisk changes later. Any backend may call
> GetOldestMultiXactOnDisk() via SetMultiXactIdLimit(). If a backend does that
> concurrent with the checkpointer running TruncateMultiXact() and sees a stale
> oldestMultiXactOnDisk, is that harmless?

As far as I can tell, it's pretty much harmless. I mean, we've
already discussed the risk that the head and tail could get too far
apart, because really it should be TruncateMultiXact(), not
SetMultiXactIdLimit(), that establishes the new stop point. But that
problem exists independent of what you're talking about here.

>> I can see that there might be an issue there, but I can't quite put my
>> finger on it well enough to say that it definitely is an issue. This
>> code is examining the offsets space rather than the members space, and
>> the protections against offsets wraparound have been there since the
>> original commit of this feature
>> (0ac5ad5134f2769ccbaefec73844f8504c4d6182). To my knowledge we have
>> no concrete evidence that there's ever been a problem in this area.
>> It seems like it might be safer to rejigger that code so that it
>> considers distance-behind-current rather than using the wrapped
>> comparison logic, but I'm reluctant to start rejiggering more things
>> without knowing what specifically I'm fixing.
>
> Anything that could cause the pg_multixact/offsets tail to rotate from being
> in the past to being in the future poses this risk. (This is the tail from
> the perspective of files on disk; pg_control, datminmxid, and MultiXactState
> notions of the tail play no part here.) I had in mind that the pg_upgrade
> datminmxid=1 bug could be a tool for achieving that, but I've been
> unsuccessful so far at building a credible thought experiment around it. Near
> the beginning of your reply, you surmised that this could happen between a
> VACUUM's SetMultiXactIdLimit() and the next checkpoint's TruncateMultiXact().
> Another vector is unlink() failure on a segment file. SlruDeleteSegment()
> currently ignores the unlink() return value; the only harm has been some disk
> usage. With GetOldestMultiXactOnDisk() as-proposed, successful unlink() is
> mandatory to forestall the appearance of a wrapped state.

I'm having trouble figuring out what to do about this. I mean, the
essential principle of this patch is that if we can't count on
relminmxid, datminmxid, or the control file to be accurate, we can at
least look at what is present on the disk. If we also cannot count on
that to be accurate, we are left without any reliable source of
information. Consider a hypothetical cluster where all our stored
minmxids of whatever form are corrupted (say, all change to 1) and in
addition there are stray files in pg_multixact. I don't think there's
really any way to get ourselves out of trouble in that scenario.

There are some things that we could do that might help a little. For
example, if the control file's oldest-multi value points to a file
that is not on disk, we could advance it a file at a time until it
lands on a file that is present. This would help if the oldest-multi
value is pointing to the future, but everything on disk is in the
past.

Another idea is that we could rely on the "on disk" value only until
MultiXactState->lastCheckpointedOldest actually points to a real file.
After we get out of this kind of trouble, we should never get back
into it.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Andres Freund 2015-06-02 15:27:33 Re: Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1
Previous Message Dorian Hoxha 2015-06-02 14:58:29 Re: Database designpattern - product feature

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2015-06-02 15:16:59 Re: checkpointer continuous flushing
Previous Message Tom Lane 2015-06-02 15:15:27 Re: [PATCH] Add error handling to byteaout.