Re: BUG #8673: Could not open file "pg_multixact/members/xxxx" on slave during hot_standby

From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Serge Negodyuck <petr(at)petrovich(dot)kiev(dot)ua>
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: BUG #8673: Could not open file "pg_multixact/members/xxxx" on slave during hot_standby
Date: 2013-12-09 18:27:01
Message-ID: 20131209182701.GD9519@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs pgsql-hackers

Hi,

On 2013-12-09 17:49:34 +0200, Serge Negodyuck wrote:
> On master there are files from 0000 to 14078
>
> On slave there were absent files from A1xx to FFFF
> They were the oldest ones. (October, November)

Some analysis later, I am pretty sure that the origin is a longstanding
problem and not connected to 9.3.[01] vs 9.3.2.

The above referenced 14078 file is exactly the last page before a
members wraparound:
(gdb) p/x (1L<<32)/(MULTIXACT_MEMBERS_PER_PAGE * SLRU_PAGES_PER_SEGMENT)
$10 = 0x14078

So, what happened is that enough multixacts where created, that the
members slru wrapped around. It's not unreasonable for the members slru
to wrap around faster then the offsets one - after all we create at
least two entries into members for every offset entry. Also in 9.3+
there fit more xids on a offset than a members page.
When truncating, we first read the offset, to know where we currently
are in members, and then truncate both from their respective
point. Since we've wrapped around in members we very well might remove
content we actually need.

I've recently remarked that I find it dangerous that we only do
anti-wraparound stuff for pg_multixact/offsets, not for /members. So,
here we have the proof that that's bad.

This is an issue in <9.3 as well. It might, in some sense, even be worse
there, because we never vacuum old multis away. But on the other hand,
the growths of multis is slower there and we look into old multis less
frequently.

The only reason that you saw the issue on the standby first is that the
truncation code is called more frequently there. Afaics it will happen,
sometime in the future, on the master as well.

I think problems should be preventable if you issue a systemwide VACUUM
FREEZE, but please let others chime in before you execute it.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Alvaro Herrera 2013-12-09 19:00:32 Re: BUG #8673: Could not open file "pg_multixact/members/xxxx" on slave during hot_standby
Previous Message Maciek Sakrejda 2013-12-09 17:56:42 Re: BUG #8656: Duplicate data violating unique constraints

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Geoghegan 2013-12-09 18:37:08 Re: Performance optimization of btree binary search
Previous Message Josh Berkus 2013-12-09 18:03:50 Re: ANALYZE sampling is too good