Re: [BUGS] BUG #8673: Could not open file "pg_multixact/members/xxxx" on slave during hot_standby

From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [BUGS] BUG #8673: Could not open file "pg_multixact/members/xxxx" on slave during hot_standby
Date: 2014-06-17 23:17:40
Message-ID: CAMkU=1wKqmmvCYWWDb1Pip8G1YZY7BOiSSNasqvUMEN5N_rpLQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs pgsql-hackers

(Redirected from Bugs to Hackers, as it seems not to be the same bug
as the original one.)

On Mon, Jun 9, 2014 at 3:50 PM, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com> wrote:
> Jeff Janes wrote:
>
>> This patch seems to solve a problem I've also been having with non-existent
>> "pg_multixact/members/13D35" files in my testing of torn page write and fk
>> locks again recent 9.4. However, sometimes I've found the problem even
>> before multi wrap around occured, and my reading of this thread suggests
>> that the current issue would not show up in that case, so I don't know how
>> much comfort to take in the knowledge that the problem seems to have gone
>> away--as they say, things that go away by themselves without explanation
>> also tend to come back by themselves.
>
> I think this problem is rather related to the code touched by commit
> 722acf51a0d074d19782ad7e97ebe3fdb63fbb87, which is about removing
> unneeded files in pg_multixact/members. I checked your files, and it
> seems that pg_multixact/members files that should still exist are gone;
> for instance, pg_controldata says that the oldest multi is 2586314407,
> but if I run
> select * from pg_get_multixact_members('that value')
> I get an error saying that pg_multixact/members/218B does not exist.
>
> It seems that there are a lot of files that were improperly removed.
> I expect that existing files should be sequential, but there are a lot
> of holes in a listing of pg_multixact/members.

I think the holes are because a truncation was in progress and
interrupted by a 'pg_ctl stop -mi' before it completed. The files are
considered in an apparently random order, so if it is interrupted
midway through it will leave a collection of holes that would have
been closed up had it been allowed to continue.

I instrumented the code to log file deletion (attached patch) and I
see that the it starts complaining about the absence of 5183 just a
fraction of a second after it was deleted, so the deletion must have
been inappropriate.

12538 2014-06-17 12:10:02.925 PDT:LOG: JJ deleting 0xb66b20 5183
12498 UPDATE 2014-06-17 12:10:03.188 PDT:DETAIL: Could not open file
"pg_multixact/members/5183": No such file or directory.
12561 UPDATE 2014-06-17 12:10:04.473 PDT:DETAIL: Could not open file
"pg_multixact/members/5183": No such file or directory.
12572 UPDATE 2014-06-17 12:10:04.475 PDT:DETAIL: Could not open file
"pg_multixact/members/5183": No such file or directory.

This problem was initially fairly easy to reproduce, but since I
started adding instrumentation specifically to catch it, it has become
devilishly hard to reproduce.

I think my next step will be to also log each of the values which goes
into the complex if (...) expression that decides on the deletion.

Cheers,

Jeff

Attachment Content-Type Size
member_delete_log.patch application/octet-stream 999 bytes

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Pavan Deolasee 2014-06-18 07:01:59 Re: BUG #10675: alter database set tablespace and unlogged table
Previous Message smsiebe 2014-06-17 17:55:11 BUG #10680: LDAP bind password leaks to log on failed authentication

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2014-06-17 23:22:07 Re: comparison operators
Previous Message Andrew Dunstan 2014-06-17 23:12:39 comparison operators