Re: Error on vacuum: xmin before relfrozenxid

From: Paolo Crosato <paolo(dot)crosato(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: pgsql-general(at)lists(dot)postgresql(dot)org, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Error on vacuum: xmin before relfrozenxid
Date: 2018-05-22 22:04:26
Message-ID: CAKMFJucLdKK+Js1+o+z-aJ8qRuWMSivmpzPK_8zQNLLpysRsAA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general pgsql-hackers

Good evening,

2018-05-22 23:19 GMT+02:00 Andres Freund <andres(at)anarazel(dot)de>:

> Hi,
>
> On 2018-05-22 21:43:01 +0200, Paolo Crosato wrote:
> > > Could you report the result of
> > > select ctid, xmin, xmax from pg_authid ;
> > >
> >
> > This is the result:
> >
> > postgres=# select ctid, xmin, xmax from pg_authid ;
>
> > (0,16) | 3031994631 | 0
> > 16 | 6496 | 1 | 144 | 3031994631 | 0 | 0 |
> (0,16)
> > | 32779 | 10507 | 32 | 1111111111000000 | 675851 |
> > \x6e6167696f7300000000000000000000000000000000000000000000000
> 000000000000000000000000000000000000000000000000000000000000
> 0000000000001000001000000ffffffff496d64356333633236616163636
> 439616665346437383061396239613464663634653639
>
> > postgres=# select relfrozenxid from pg_class where relname='pg_authid';
> > relfrozenxid
> > --------------
> > 400011439
>
> That tuple looks, to me, like it indeed shouldn't exist, given the
> relfrozenxid. Decoding infomask (10507 / 0x290B), if I did it right,
> shows:
>
> HEAP_HASNULL
> HEAP_HASVARWIDTH
> HEAP_HASOID
> HEAP_XMIN_COMMITTED
> HEAP_XMAX_INVALID
> HEAP_UPDATED
>
> so it's not frozen. That suggests there's either a bug, or you had
> corruption in your cluster.
>
>
> Could you give a bit of "history" about that postgres instance? What
> version of postgres was it run on earliest? Was it ever pg_upgraded?
> Were there any OS crashes? Other hardware issues? Was the cluster ever
> used with fsync=off or full_page_writes=off?
>
> Greetings,
>
> Andres Freund
>

The cluster is made of a primary master instance, and a secondary slave in
hot standby, with streaming replication. There is a barman server for the
backups.
The first version it ran on was 10.2, the cluster was promoted in
production in mid february. It was never pg_upgraded, we upgraded it
yesterday to 10.4 and restarted, in hope that this would resolve the issue.
The restart was performed by the upgrading process. There was never any
crash or hardware issue, the instance run without any problem. we never
enabled fsync=off or full_page_writes=off. This is the only real issue so
far. We have many nagios alerts monitoring the instance, the only problem
is we are creating many temporary files, most of them are caused by a query
that doesn't overflow work_mem when run alone with explain analyze buffers,
which is weird. I should mention that we had a major creation of temp files
in the first hours of the cluster history, however that was quickly
resolved.

I managed to recover the log of the first time we run into the issue, the
error was the same but on template1:

May 8 11:26:46 xxx postgres[32543]: [1154-1] user=,db=,client= ERROR:
found xmin 2600758304 from before relfrozenxid 400011439
May 8 11:26:46 xxx postgres[32543]: [1154-2] user=,db=,client= CONTEXT:
automatic vacuum of table "template1.pg_catalog.pg_authid"

Deleting the row worked, but, as you see, the problem with 400011439 was
there already. This happened when we added a user, shortly afterwards these
errors started popping up in the log. We found the issue because the
autovacuums stopped working, they were blocked on the pg_authid table. The
same time we found out the long running transaction, it was a multi session
kill that someway got stuck. Killing the transaction and deleting the
offending row was enough for the autovacuums to restart. We also had a
materialized view that was created months before and nobody used. Since i
found in the past there was a similar error related to materialized views,
I dropped it.

The instance is doing a lot of transactions every day, probably around 100
millions. I would need to do more checks to provide an accurate measure.

I still have the logs since the first time the error appeared, so I can
provide further details if needed.

Thanks,
Best Regards

Paolo Crosato

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Thomas Munro 2018-05-22 23:20:14 Re: pg_multixact/members growing
Previous Message Andres Freund 2018-05-22 21:19:37 Re: Error on vacuum: xmin before relfrozenxid

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2018-05-22 22:06:03 Re: COPY FROM WITH HEADER skips a tuple every 4 billion tuples
Previous Message Tom Lane 2018-05-22 22:02:27 Re: plperl fails with perl 5.28