Re: ERROR: multixact X from before cutoff Y found to be still running

From: Jeremy Schneider <schnjere(at)amazon(dot)com>
To: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc: "pgsql-bugs(at)postgresql(dot)org" <pgsql-bugs(at)postgresql(dot)org>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, "Bossart, Nathan" <bossartn(at)amazon(dot)com>, "Nasby, Jim" <nasbyj(at)amazon(dot)com>
Subject: Re: ERROR: multixact X from before cutoff Y found to be still running
Date: 2019-09-05 18:32:15
Message-ID: 32134d69-1999-6518-e1d6-e268a24cb8aa@amazon.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs pgsql-hackers

On 9/4/19 21:01, Thomas Munro wrote:
> I suppose this requires you to run out of member space (for example
> many backends key sharing the same FK) or maybe just set
> autovacuum_multixact_freeze_max_age quite low, and then prolong the
> life of a multixact for longer.
On this particular production system,
autovacuum_multixact_freeze_max_age is the default value of 400 million
and it is not overridden for any tables.  Looks to me like this was just
workload driven.  There are a number of FKs and those seem to be a
likely candidate to me.

> Does the problem fix itself once you
> close the transaction that's in the oldest multixact, ie holding back
> GetOldestMultiXact() from advancing?
The really interesting thing about this case is that the only
long-running connection was the autovacuum that had been running since
Sunday. While we were investigating yesterday, the autovacuum process
died without advancing relfrozenxid (users configured this system with
poor logging, so it's not known whether autovac terminated from error or
from a user who logged on to the system).  As soon as the autovacuum
process died, we stopped getting the "multixact X from before cutoff Y"
errors.

It really appears that it was the autovacuum process itself that was
providing the oldest running multixact which caused errors on
yesterday's attempts to vacuum other tables - even though I though
vacuum processes were ignored by that code.  I'll have to take another
look at some point.

Vacuum cost parameters had been adjusted after Sunday, so the original
autovacuum would have used default settings.  Naturally, a new
autovacuum process started up right away.  This new process - definitely
using adjusted cost parameters - completed the vacuum of the large table
with 5 passes (index_vacuum_count) in a couple hours.  Maintenance work
memory was already at the max; there were many hundreds of millions of
dead tuples that still remained to be cleaned up.

The size of the large table (heap only) was about 75% of the memory on
the server, and the table had three indexes each about half the size of
the table.  The storage was provisioned at just over 10k IOPS; at this
rate you could read all three indexes from the storage one block at a
time in about an hour.  (And Linux should be reading more than a block
at a time.)

It is not known whether the original autovacuum failed to completely
vacuum the large table in 3 days because of cost settings alone or
because there's another latent bug somewhere in the autovacuum code that
put it into some kind of loop (but if autovac hit the error above then
the PID would have terminated).  We didn't manage to get a pstack.

> Since VACUUM errors out, we
> don't corrupt data, right? Everyone else is still going to see the
> multixact as running and do the right thing because vacuum never
> manages to (bogusly) freeze the tuple.
That's my take as well.  I don't think there's any data corruption risk
here.

If anyone else ever hits this in the future, I think it's safe to just
kill the oldest open session. The error should go away and there
shouldn't be any risk of damage to the database.

> Both patches prevent mxactLimit from being newer than the oldest
> running multixact. The v1 patch uses the most aggressive setting
> possible: the oldest running multi; the v2 uses the least aggressive
> of the 'safe' and oldest running multi. At first glance it seems like
> the second one is better: it only does something different if we're in
> the dangerous scenario you identified, but otherwise it sticks to the
> safe limit, which generates less IO.
Thanks for taking a look!

-Jeremy

--
Jeremy Schneider
Database Engineer
Amazon Web Services

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Andres Freund 2019-09-05 19:59:11 Re: PostgreSQL12 crash bug report
Previous Message Euler Taveira 2019-09-05 12:31:38 Re: BUG #15992: Index size larger than the base table size. Sometime 3 times large

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2019-09-05 18:50:50 AtEOXact_Snapshot timing
Previous Message Tom Lane 2019-09-05 18:22:54 Re: [PATCH] Connection time for \conninfo