9.3: more problems with "Could not open file "pg_multixact/members/xxxx"

From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: 9.3: more problems with "Could not open file "pg_multixact/members/xxxx"
Date: 2014-07-15 22:58:35
Message-ID: CAMkU=1wX9eUumStJODnigW6kB==aNJv5jCUwybzRMNi=Qajs1w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Jun 27, 2014 at 11:51 AM, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
wrote:

> Jeff Janes wrote:
>
> > This problem was initially fairly easy to reproduce, but since I
> > started adding instrumentation specifically to catch it, it has become
> > devilishly hard to reproduce.
> >
> > I think my next step will be to also log each of the values which goes
> > into the complex if (...) expression that decides on the deletion.
>
> Could you please to reproduce it after updating to latest? I pushed
> fixes that should close these issues. Maybe you want to remove the
> instrumentation you added, to make failures more likely.
>

There are still some problems in 9.4, but I haven't been able to diagnose
them and wanted to do more research on it. The announcement of upcoming
back-branches for 9.3 spurred me to try it there, and I have problems with
9.3 (12c5bbdcbaa292b2a4b09d298786) as well. The move of truncation to the
checkpoint seems to have made the problem easier to reproduce. On an 8
core machine, this test fell over after about 20 minutes, which is much
faster than it usually reproduces.

This the error I get:

2084 UPDATE 2014-07-15 15:26:20.608 PDT:ERROR: could not access status of
transaction 85837221
2084 UPDATE 2014-07-15 15:26:20.608 PDT:DETAIL: Could not open file
"pg_multixact/members/14031": No such file or directory.
2084 UPDATE 2014-07-15 15:26:20.608 PDT:CONTEXT: SQL statement "SELECT 1
FROM ONLY "public"."foo_parent" x WHERE "id" OPERATOR(pg_catalog.=) $1 FOR
KEY SHARE OF x"

The testing harness is attached as 3 patches that must be made to the test
server, and 2 scripts. The script do.sh sets up the database (using fixed
paths, so be careful) and then invokes count.pl in a loop to do the actual
work.

Cheers,

Jeff

Attachment Content-Type Size
0002-pg_burn_multixact-utility.patch application/octet-stream 7.0 KB
count.pl application/octet-stream 9.7 KB
crash_REL9_4_BETA1.patch application/octet-stream 12.6 KB
do.sh application/x-sh 3.6 KB
member_delete_log.patch application/octet-stream 999 bytes

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2014-07-16 01:17:50 Re: Allowing join removals for more join types
Previous Message Robert Haas 2014-07-15 22:41:41 Re: returning SETOF RECORD