Re: Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

From: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Steve Kehlet <steve(dot)kehlet(at)gmail(dot)com>, Forums postgresql <pgsql-general(at)postgresql(dot)org>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1
Date: 2015-05-28 22:37:57
Message-ID: CAEepm=3NxNE7Emqkhr=QtyKak_LWR-TrhV-RqfAN1PO4sCn6WQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general pgsql-hackers

On Fri, May 29, 2015 at 7:56 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Thu, May 28, 2015 at 8:51 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>> [ speculation ]
>
> [...] However, since
> the vacuum did advance relfrozenxid, it will call vac_truncate_clog,
> which will call SetMultiXactIdLimit, which will propagate the bogus
> datminmxid = 1 setting into shared memory.

Ah!

> [...]
>
> - There's a third possible problem related to boundary cases in
> SlruScanDirCbRemoveMembers, but I don't understand that one well
> enough to explain it. Maybe Thomas can jump in here and explain the
> concern.

I noticed something in passing which is probably not harmful, and not
relevant to this bug report, it was just a bit confusing while
testing: SlruScanDirCbRemoveMembers never deletes any files if
rangeStart == rangeEnd. In practice, if you have an idle cluster with
a lot of multixact data and you VACUUM FREEZE all databases and then
CHECKPOINT, you might be surprised to see no member files going away
quite yet, but they'll eventually be truncated by a future checkpoint,
once rangeEnd has had a chance to advance to the next page due to more
multixacts being created.

If we want to fix this one day, maybe the right thing to do is to
treat the rangeStart == rangeEnd case the same way we treat rangeStart
< rangeEnd, that is, to assume that the range of pages isn't
wrapped/inverted in this case. Although we don't have the actual
start and end offset values to compare here, we know that for them to
fall on the same page, the start offset index must be <= the end
offset index (since we added the new error to prevent member space
wrapping, we never allow the end to get close enough to the start to
fall on the same page). Like this (not tested):

diff --git a/src/backend/access/transam/multixact.c
b/src/backend/access/transam/multixact.c
index 9568ff1..4d0bcc4 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -2755,7 +2755,7 @@ SlruScanDirCbRemoveMembers(SlruCtl ctl, char
*filename, int segpage,
/* Recheck the deletion condition. If it still holds, perform deletion */
if ((range->rangeStart > range->rangeEnd &&
segpage > range->rangeEnd && segpage < range->rangeStart) ||
- (range->rangeStart < range->rangeEnd &&
+ (range->rangeStart <= range->rangeEnd &&
(segpage < range->rangeStart || segpage > range->rangeEnd)))
SlruDeleteSegment(ctl, filename);

--
Thomas Munro
http://www.enterprisedb.com

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Robert Haas 2015-05-28 23:24:26 Re: Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1
Previous Message Alvaro Herrera 2015-05-28 22:21:39 Re: [HACKERS] Re: 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2015-05-28 23:24:26 Re: Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1
Previous Message Alvaro Herrera 2015-05-28 22:21:39 Re: [HACKERS] Re: 9.4.1 -> 9.4.2 problem: could not access status of transaction 1