Re: MultiXactId Error in Autovacuum

From: Karl Hafner <karl(at)scoreloop(dot)com>
To: Albe Laurenz <laurenz(dot)albe(at)wien(dot)gv(dot)at>
Cc: Strahinja Kustudic <strahinjak(at)nordeus(dot)com>, "pgsql-admin(at)postgresql(dot)org" <pgsql-admin(at)postgresql(dot)org>
Subject: Re: MultiXactId Error in Autovacuum
Date: 2013-12-03 08:55:54
Message-ID: CA+MJ4cTMTDKZa5P=rGbPiK0O0tv2xxpH7_Zopmi1Av0z61UUYw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-admin

Hi all,

sorry for not getting back earlier. I wanted to let you know that we found
the reason for our error (and a possible solution for our problem) in the
meantime:

The starting point of our worries was a log file error message: "ERROR:
MultiXactId 2683601542 does no longer exist -- apparent wraparound".
Autovacuum never finished on one of our tables because of this.

So we started to investigate and found out:

We had a few rows in our table that suffered from some inconsistencies in
its rowlock status, probably caused by a disc crash a few month ago.
Statements like "SELECT * FROM mytable" or autovacuum run into an error and
exit when reading one of these inconstistent rows.

We digged deeper: Those rows were marked already DEAD but still pointed to
MultiXactIds that were not listed in
pg_multixact anymore (what probably messed up the lock information)

The interesting point is: Different statements read those information in a
different order.

(a) Statements that select a single row like "SELECT * FROM mytable WHERE
id = 'xxxx'" would recognize the status DEAD
BEFORE checking the LOCK status. Those statements just return 0 rows.

(b) Statements that read every row like "SELECT id FROM mytable" process
the LOCK information BEFORE checking if a row is already marked DEAD.
If there is something wrong (like in our case due to inconsistent
lock information) the statement quits with an ERROR.
In our case autovacuum always died because of this.

The trick was patching postgres for case (b) in a way that it would just
print a WARNING and the row ID for these defect rows and continue.

Finally we knew the IDs of the defect rows (we had to install the contrib
module pgrowlocks to find the locked rows) using the query
"SELECT id FROM mytable AS m, pgrowlocks('mytable') AS p WHERE p.locked_row
= m.ctid".
Just deleting these rows using "DELETE FROM mytable WHERE id = 'xxxx'" did
not work because of case (a). It returned 0 rows.
So we copied all but the defect rows into a new table, dropped the old
table, and renamed the new one - quite an effort for finally deleting two
rows :-)

We are wondering it there is a deeper reason behind treating case (a) and
(b) in a different way or if it would be better to get rid of these
inconsistencies in future postgres versions.

Thanks for all your input and help.
BR,
Karl

On Wed, Nov 20, 2013 at 3:20 PM, Karl Hafner <karl(at)scoreloop(dot)com> wrote:

>
> Well, probably you are right, because PostgreSQL would shutdown
>> before a wraparound happens, and you would have noticed that.
>
>
>
>> I don't know how it happened, but your database has suffered
>> corruption. Any chance you can restore from a backup?
>> Can you still "SELECT *" from the table that has the problem?
>>
>
> I cannot run a full SELECT * on that table. It stops after a few minutes
> with the same error.
> I am currently testing different things on a DB created from a backup like
> exporting per month/per day …
> but I am running into this error again and again.
> Running a VACUUM --FULL gives me: WARNING: concurrent delete in
> progress within table "the_bad_table" … then it also breaks.
>
>
> BR,
> Karl
>
>

In response to

Responses

Browse pgsql-admin by date

  From Date Subject
Next Message Simon Riggs 2013-12-03 09:12:43 Re: MultiXactId Error in Autovacuum
Previous Message Ray Stell 2013-11-27 21:05:47 Re: Postgres 9.2 pg_xlog cleanup