Re: 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Noah Misch <noah(at)leadboat(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Steve Kehlet <steve(dot)kehlet(at)gmail(dot)com>, Forums postgresql <pgsql-general(at)postgresql(dot)org>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 9.4.1 -> 9.4.2 problem: could not access status of transaction 1
Date: 2015-06-01 18:06:05
Message-ID: CA+TgmoYgKz=PQKNSk7sCkxcZWTw_p5P6a+rjQ1LjY=GA_W6Y-w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general pgsql-hackers

On Mon, Jun 1, 2015 at 12:46 AM, Noah Misch <noah(at)leadboat(dot)com> wrote:
> Incomplete review, done in a relative rush:

Thanks.

> On Fri, May 29, 2015 at 03:08:11PM -0400, Robert Haas wrote:
>> OK, here's a patch. Actually two patches, differing only in
>> whitespace, for 9.3 and for master (ha!). I now think that the root
>> of the problem here is that DetermineSafeOldestOffset() and
>> SetMultiXactIdLimit() were largely ignorant of the possibility that
>> they might be called at points in time when the cluster was
>> inconsistent.
>
> A cause perhaps closer to the root is commit f741300 moving truncation from
> VACUUM to checkpoints. CLOG has given us deep experience with VACUUM-time
> truncation. Commit f6a6c46d and this patch are about bringing CHECKPOINT-time
> truncation up to the same level.
>
> Speaking of commit f6a6c46d, it seems logical that updating allocation stop
> limits should happen proximate to truncation. That's currently the case for
> CLOG (vac_truncate_clog() does both) and pg_multixact/members (checkpoint's
> TruncateMultiXact() call does both). However, pg_multixact/offsets is
> truncated from TruncateMultiXact(), but vac_truncate_clog() updates its limit.
> I did not distill an errant test case, but this is fishy.

Good point. Because we confine ourselves to using half the offset
space, it seems much harder for us to get into trouble here than it is
with members. The first scenario that occurred to me is that the SLRU
might actually wrap. That seems tough, though: between one checkpoint
and the next, vacuum would need to advance oldest_datminmxid by 2^31
MXIDs while generating 2^31 new ones, or something like that. That
doesn't seem real plausible. But then it occurred to me that it's
probably sufficient to advance the head of the SLRU far enough that
TruncateMultiXact things that the tail is in the future instead of in
the past. I see no reason why that couldn't happen. Then we'd end up
leaving some files behind that we should have removed. I'm not sure
exactly what problem that would cause; would they just get overwritten
on the next pass through the space, or would they cause errors? I
have not had time to check.

>> SetMultiXactIdLimit() bracketed certain parts of its
>> logic with if (!InRecovery), but those guards were ineffective because
>> it gets called before InRecovery is set in the first place.
>
> SetTransactionIdLimit() checks InRecovery for the same things, and it is
> called at nearly the same moments as SetMultiXactIdLimit(). Do you have a
> sense of whether it is subject to similar problems as a result?

Well, I think it's pretty weird that those things will get done before
beginning recovery, even on an inconsistent cluster, but not during
recovery. That is pretty strange. I don't see that it can actually
do any worse than emit a few log messages at the start of recovery
that won't show up again until the end of recovery, though.

>> 1. Moves the call to DetermineSafeOldestOffset() that appears in
>> StartupMultiXact() to TrimMultiXact(), so that we don't try to do this
>> until we're consistent. Also, instead of passing
>> MultiXactState->oldestMultiXactId, pass the newer of that value and
>> the earliest offset that exists on disk. That way, it won't try to
>> read data that's not there.
>
> Perhaps emit a LOG message when we do that, since it's our last opportunity to
> point to the potential data loss?

If the problem is just that somebody minmxid got set to 1 instead of
the real value, I think that there is no data loss, because none of
those older values are actually present there. But we could add a LOG
message anyway. How do you suggest that we phrase that?

>> + * PostgreSQL 9.3.0 through 9.3.6 and PostgreSQL 9.4.0 through 9.4.1
>> + * had bugs that could allow users who reached those release through
>
> s/release/releases/

Fixed.

>> @@ -2859,6 +2947,14 @@ TruncateMultiXact(void)
>> SimpleLruTruncate(MultiXactOffsetCtl,
>> MultiXactIdToOffsetPage(oldestMXact));
>>
>> + /* Update oldest-on-disk value in shared memory. */
>> + earliest = range.rangeStart * MULTIXACT_OFFSETS_PER_PAGE;
>> + if (earliest < FirstMultiXactId)
>> + earliest = FirstMultiXactId;
>> + LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE);
>> + Assert(MultiXactState->oldestMultiXactOnDiskValid);
>> + MultiXactState->oldestMultiXactOnDiskValid = earliest;
>
> That last line needs s/Valid//, I presume. Is it okay that
> oldestMultiXactOnDisk becomes too-old during TruncateMultiXact(), despite its
> Valid indicator remaining true?

Ay yai yay. Yes, s/Valid//.

I'm not sure what you mean about it becoming too old. At least with
that fix, it should get updated to exactly the first file that we
didn't remove. Isn't that right?

>> +static MultiXactOffset
>> +GetOldestMultiXactOnDisk(void)
>> +{
>
>> + SlruScanDirectory(MultiXactOffsetCtl, SlruScanDirCbFindEarliest, &trunc);
>> + earliest = trunc.earliestExistingPage * MULTIXACT_OFFSETS_PER_PAGE;
>> + if (earliest < FirstMultiXactId)
>> + earliest = FirstMultiXactId;
>
> SlruScanDirCbFindEarliest() is only meaningful if the files on disk do not
> represent a wrapped state. When the files do represent a wrapped state,
> MultiXactIdPrecedes() is not transitive, and the SlruScanDirCbFindEarliest()
> result is sensitive to readdir() order. This code exists specifically to help
> us make do in the face of wrong catalog and pg_control entries. We may have
> wrapped as a result of those of same catalog/pg_control entries, so I think
> this function ought to account for wrap. I haven't given enough thought to
> what exactly it should do.

I can see that there might be an issue there, but I can't quite put my
finger on it well enough to say that it definitely is an issue. This
code is examining the offsets space rather than the members space, and
the protections against offsets wraparound have been there since the
original commit of this feature
(0ac5ad5134f2769ccbaefec73844f8504c4d6182). To my knowledge we have
no concrete evidence that there's ever been a problem in this area.
It seems like it might be safer to rejigger that code so that it
considers distance-behind-current rather than using the wrapped
comparison logic, but I'm reluctant to start rejiggering more things
without knowing what specifically I'm fixing.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Guillaume Lelarge 2015-06-01 18:13:42 Re: Is my standby fully connected?
Previous Message Guillaume Lelarge 2015-06-01 18:04:26 Re: Is my standby fully connected?

Browse pgsql-hackers by date

  From Date Subject
Next Message Josh Berkus 2015-06-01 18:09:31 Re: [CORE] postpone next week's release
Previous Message Andrew Gierth 2015-06-01 18:03:55 Re: Join Filter vs. Index Cond (performance regression 9.1->9.2+/HEAD)