Re: 9.3: more problems with "Could not open file "pg_multixact/members/xxxx"

From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 9.3: more problems with "Could not open file "pg_multixact/members/xxxx"
Date: 2014-08-19 20:24:20
Message-ID: CAMkU=1wKFSCByKkhdbPPD49ENJQJ9NrXDkDHZyBdqiL1KGdWTA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Jul 15, 2014 at 3:58 PM, Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:

> On Fri, Jun 27, 2014 at 11:51 AM, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com
> > wrote:
>
>> Jeff Janes wrote:
>>
>> > This problem was initially fairly easy to reproduce, but since I
>> > started adding instrumentation specifically to catch it, it has become
>> > devilishly hard to reproduce.
>> >
>> > I think my next step will be to also log each of the values which goes
>> > into the complex if (...) expression that decides on the deletion.
>>
>> Could you please to reproduce it after updating to latest? I pushed
>> fixes that should close these issues. Maybe you want to remove the
>> instrumentation you added, to make failures more likely.
>>
>
> There are still some problems in 9.4, but I haven't been able to diagnose
> them and wanted to do more research on it. The announcement of upcoming
> back-branches for 9.3 spurred me to try it there, and I have problems with
> 9.3 (12c5bbdcbaa292b2a4b09d298786) as well. The move of truncation to the
> checkpoint seems to have made the problem easier to reproduce. On an 8
> core machine, this test fell over after about 20 minutes, which is much
> faster than it usually reproduces.
>
> This the error I get:
>
> 2084 UPDATE 2014-07-15 15:26:20.608 PDT:ERROR: could not access status of
> transaction 85837221
> 2084 UPDATE 2014-07-15 15:26:20.608 PDT:DETAIL: Could not open file
> "pg_multixact/members/14031": No such file or directory.
> 2084 UPDATE 2014-07-15 15:26:20.608 PDT:CONTEXT: SQL statement "SELECT 1
> FROM ONLY "public"."foo_parent" x WHERE "id" OPERATOR(pg_catalog.=) $1 FOR
> KEY SHARE OF x"
>
> The testing harness is attached as 3 patches that must be made to the test
> server, and 2 scripts. The script do.sh sets up the database (using fixed
> paths, so be careful) and then invokes count.pl in a loop to do the
> actual work.
>

Sorry, after a long time when I couldn't do much testing on this, I've now
been able to get back to it.

It looks like what is happening is that checkPoint.nextMultiOffset wraps
around from 2^32 to 0, even if 0 is still being used. At that point it
starts deleting member files that are still needed.

Is there some interlock which is supposed to prevent from
checkPoint.nextMultiOffset rom lapping iself? I haven't been able to find
it. It seems like the interlock applies only to MultiXid, not the Offset.

Thanks,

Jeff

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2014-08-19 20:27:34 Re: 9.3: more problems with "Could not open file "pg_multixact/members/xxxx"
Previous Message Josh Berkus 2014-08-19 19:52:47 Re: [patch] pg_copy - a command for reliable WAL archiving