Re: 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

From: Noah Misch <noah(at)leadboat(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Steve Kehlet <steve(dot)kehlet(at)gmail(dot)com>, Forums postgresql <pgsql-general(at)postgresql(dot)org>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 9.4.1 -> 9.4.2 problem: could not access status of transaction 1
Date: 2015-06-02 05:21:21
Message-ID: 20150602052121.GA50317@tornado.leadboat.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general pgsql-hackers

On Mon, Jun 01, 2015 at 02:06:05PM -0400, Robert Haas wrote:
> On Mon, Jun 1, 2015 at 12:46 AM, Noah Misch <noah(at)leadboat(dot)com> wrote:
> > On Fri, May 29, 2015 at 03:08:11PM -0400, Robert Haas wrote:
> >> SetMultiXactIdLimit() bracketed certain parts of its
> >> logic with if (!InRecovery), but those guards were ineffective because
> >> it gets called before InRecovery is set in the first place.
> >
> > SetTransactionIdLimit() checks InRecovery for the same things, and it is
> > called at nearly the same moments as SetMultiXactIdLimit(). Do you have a
> > sense of whether it is subject to similar problems as a result?
>
> Well, I think it's pretty weird that those things will get done before
> beginning recovery, even on an inconsistent cluster, but not during
> recovery. That is pretty strange. I don't see that it can actually
> do any worse than emit a few log messages at the start of recovery
> that won't show up again until the end of recovery, though.

Granted. Would it be better to update both functions at the same time, and
perhaps to make that a master-only change? Does the status quo cause more
practical harm via SetMultiXactIdLimit() than via SetTransactionIdLimit()?

> >> 1. Moves the call to DetermineSafeOldestOffset() that appears in
> >> StartupMultiXact() to TrimMultiXact(), so that we don't try to do this
> >> until we're consistent. Also, instead of passing
> >> MultiXactState->oldestMultiXactId, pass the newer of that value and
> >> the earliest offset that exists on disk. That way, it won't try to
> >> read data that's not there.
> >
> > Perhaps emit a LOG message when we do that, since it's our last opportunity to
> > point to the potential data loss?
>
> If the problem is just that somebody minmxid got set to 1 instead of
> the real value, I think that there is no data loss, because none of
> those older values are actually present there. But we could add a LOG
> message anyway. How do you suggest that we phrase that?

There's no data loss if 1 <= true_minmxid <= 2^31 at the time minmxid got set
to 1. Otherwise, data loss is possible. I don't hope for an actionable
message, but we might want a reporter to grep logs for it when we diagnose
future reports. Perhaps this:

"missing pg_multixact/members files; begins at MultiXactId %u, expected %u"

For good measure, perhaps emit this when lastCheckpointedOldest > earliest by
more than one segment:

"excess pg_multixact/members files; begins at MultiXactId %u, expected %u"

> >> @@ -2859,6 +2947,14 @@ TruncateMultiXact(void)
> >> SimpleLruTruncate(MultiXactOffsetCtl,
> >> MultiXactIdToOffsetPage(oldestMXact));
> >>
> >> + /* Update oldest-on-disk value in shared memory. */
> >> + earliest = range.rangeStart * MULTIXACT_OFFSETS_PER_PAGE;
> >> + if (earliest < FirstMultiXactId)
> >> + earliest = FirstMultiXactId;
> >> + LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE);
> >> + Assert(MultiXactState->oldestMultiXactOnDiskValid);
> >> + MultiXactState->oldestMultiXactOnDiskValid = earliest;
> >
> > That last line needs s/Valid//, I presume. Is it okay that
> > oldestMultiXactOnDisk becomes too-old during TruncateMultiXact(), despite its
> > Valid indicator remaining true?
>
> Ay yai yay. Yes, s/Valid//.
>
> I'm not sure what you mean about it becoming too old. At least with
> that fix, it should get updated to exactly the first file that we
> didn't remove. Isn't that right?

Consider a function raw_GOMXOD() that differs from GetOldestMultiXactOnDisk()
only in that it never reads or writes the cache. I might expect
oldestMultiXactOnDisk==raw_GOMXOD() if oldestMultiXactOnDiskValid, and that
does hold most of the time. It does not always hold between the start of the
quoted code's SimpleLruTruncate() and its oldestMultiXactOnDisk assignment.
That's because raw_GOMXOD() changes at the instant we unlink the oldest
segment, but oldestMultiXactOnDisk changes later. Any backend may call
GetOldestMultiXactOnDisk() via SetMultiXactIdLimit(). If a backend does that
concurrent with the checkpointer running TruncateMultiXact() and sees a stale
oldestMultiXactOnDisk, is that harmless?

> >> +static MultiXactOffset
> >> +GetOldestMultiXactOnDisk(void)
> >> +{
> >
> >> + SlruScanDirectory(MultiXactOffsetCtl, SlruScanDirCbFindEarliest, &trunc);
> >> + earliest = trunc.earliestExistingPage * MULTIXACT_OFFSETS_PER_PAGE;
> >> + if (earliest < FirstMultiXactId)
> >> + earliest = FirstMultiXactId;
> >
> > SlruScanDirCbFindEarliest() is only meaningful if the files on disk do not
> > represent a wrapped state. When the files do represent a wrapped state,
> > MultiXactIdPrecedes() is not transitive, and the SlruScanDirCbFindEarliest()
> > result is sensitive to readdir() order. This code exists specifically to help
> > us make do in the face of wrong catalog and pg_control entries. We may have
> > wrapped as a result of those of same catalog/pg_control entries, so I think
> > this function ought to account for wrap. I haven't given enough thought to
> > what exactly it should do.
>
> I can see that there might be an issue there, but I can't quite put my
> finger on it well enough to say that it definitely is an issue. This
> code is examining the offsets space rather than the members space, and
> the protections against offsets wraparound have been there since the
> original commit of this feature
> (0ac5ad5134f2769ccbaefec73844f8504c4d6182). To my knowledge we have
> no concrete evidence that there's ever been a problem in this area.
> It seems like it might be safer to rejigger that code so that it
> considers distance-behind-current rather than using the wrapped
> comparison logic, but I'm reluctant to start rejiggering more things
> without knowing what specifically I'm fixing.

Anything that could cause the pg_multixact/offsets tail to rotate from being
in the past to being in the future poses this risk. (This is the tail from
the perspective of files on disk; pg_control, datminmxid, and MultiXactState
notions of the tail play no part here.) I had in mind that the pg_upgrade
datminmxid=1 bug could be a tool for achieving that, but I've been
unsuccessful so far at building a credible thought experiment around it. Near
the beginning of your reply, you surmised that this could happen between a
VACUUM's SetMultiXactIdLimit() and the next checkpoint's TruncateMultiXact().
Another vector is unlink() failure on a segment file. SlruDeleteSegment()
currently ignores the unlink() return value; the only harm has been some disk
usage. With GetOldestMultiXactOnDisk() as-proposed, successful unlink() is
mandatory to forestall the appearance of a wrapped state.

Thanks,
nm

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Michael Paquier 2015-06-02 07:19:40 Re: [GENERAL] psql weird behaviour with charset encodings
Previous Message Zenaan Harkness 2015-06-02 05:16:53 Re: advocating LTS release and feature-train release cycles

Browse pgsql-hackers by date

  From Date Subject
Next Message Noah Misch 2015-06-02 05:44:29 Re: [CORE] postpone next week's release
Previous Message Michael Nolan 2015-06-02 04:35:23 Re: pg_xlog -> pg_xjournal?