Re: multixacts woes

From: Noah Misch <noah(at)leadboat(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: multixacts woes
Date: 2015-05-10 17:40:12
Message-ID: 20150510174012.GA3618689@tornado.leadboat.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, May 08, 2015 at 02:15:44PM -0400, Robert Haas wrote:
> My colleague Thomas Munro and I have been working with Alvaro, and
> also with Kevin and Amit, to fix bug #12990, a multixact-related data
> corruption bug.

Thanks Alvaro, Amit, Kevin, Robert and Thomas for mobilizing to get this fixed.

> 1. I believe that there is still a narrow race condition that cause
> the multixact code to go crazy and delete all of its data when
> operating very near the threshold for member space exhaustion. See
> http://www.postgresql.org/message-id/CA+TgmoZiHwybETx8NZzPtoSjprg2Kcr-NaWGajkzcLcbVJ1pKQ@mail.gmail.com
> for the scenario and proposed fix.

For anyone else following along, Thomas's subsequent test verified this threat
beyond reasonable doubt:

http://www.postgresql.org/message-id/CAEepm=3C32VPJLOo45y0c3-3KWXNV2xM4jaPTSVjCRD2VG0Qgg@mail.gmail.com

> 2. We have some logic that causes autovacuum to run in spite of
> autovacuum=off when wraparound threatens. My commit
> 53bb309d2d5a9432d2602c93ed18e58bd2924e15 provided most of the
> anti-wraparound protections for multixact members that exist for
> multixact IDs and for regular XIDs, but this remains an outstanding
> issue. I believe I know how to fix this, and will work up an
> appropriate patch based on some of Thomas's earlier work.

That would be good to have, and its implementation should be self-contained.

> 3. It seems to me that there is a danger that some users could see
> extremely frequent anti-mxid-member-wraparound vacuums as a result of
> this work. Granted, that beats data corruption or errors, but it
> could still be pretty bad. The default value of
> autovacuum_multixact_freeze_max_age is 400000000.
> Anti-mxid-member-wraparound vacuums kick in when you exceed 25% of the
> addressable space, or 1073741824 total members. So, if your typical
> multixact has more than 1073741824/400000000 = ~2.68 members, you're
> going to see more autovacuum activity as a result of this change.
> We're effectively capping autovacuum_multixact_freeze_max_age at
> 1073741824/(average size of your multixacts). If your multixacts are
> just a couple of members (like 3 or 4) this is probably not such a big
> deal. If your multixacts typically run to 50 or so members, your
> effective freeze age is going to drop from 400m to ~21.4m. At that
> point, I think it's possible that relminmxid advancement might start
> to force full-table scans more often than would be required for
> relfrozenxid advancement. If so, that may be a problem for some
> users.

I don't know whether this deserves prompt remediation, but if it does, I would
look no further than the hard-coded 25% figure. We permit users to operate
close to XID wraparound design limits. GUC maximums force an anti-wraparound
vacuum at no later than 93.1% of design capacity. XID assignment warns at
99.5%, then stops at 99.95%. PostgreSQL mandates a larger cushion for
pg_multixact/offsets, with anti-wraparound VACUUM by 46.6% and a stop at
50.0%. Commit 53bb309d2d5a9432d2602c93ed18e58bd2924e15 introduced the
bulkiest mandatory cushion yet, an anti-wraparound vacuum when
pg_multixact/members is just 25% full. The pgsql-bugs thread driving that
patch did reject making it GUC-controlled, essentially on the expectation that
25% should be adequate for everyone:

http://www.postgresql.org/message-id/CA+Tgmoap6-o_5ESu5X2mBRVht_F+KNoY+oO12OvV_WekSA=ezQ@mail.gmail.com
http://www.postgresql.org/message-id/20150506143418.GT2523@alvh.no-ip.org
http://www.postgresql.org/message-id/1570859840.1241196.1430928954257.JavaMail.yahoo@mail.yahoo.com

> What can we do about this? Alvaro proposed back-porting his fix for
> bug #8470, which avoids locking a row if a parent subtransaction
> already has the same lock.

Like Andres and yourself, I would not back-patch it.

> Another thought that occurs to me is that if we had a freeze map, it
> would radically decrease the severity of this problem, because
> freezing would become vastly cheaper. I wonder if we ought to try to
> get that into 9.5, even if it means holding up 9.5.

Declaring that a release will wait for a particular feature has consistently
ended badly for PostgreSQL, and this feature is just in the planning stages.
If folks are ready to hit the ground running on the project, I suggest they do
so; a non-WIP submission to the first 9.6 CF would be a big accomplishment.
The time to contemplate slipping it into 9.5 comes after the patch is done.

If these aggressive ideas earn more than passing consideration, the 25%
threshold should become user-controllable after all.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message David G. Johnston 2015-05-10 17:47:42 Re: BUG #13148: Unexpected deferred EXCLUDE constraint violation on derived table
Previous Message José Luis Tallón 2015-05-10 16:41:08 Re: multixacts woes