Re: ERROR: found multixact from before relminmxid

From: Adrien NAYRAT <adrien(dot)nayrat(at)anayrat(dot)info>
To: Jeremy Finzel <finzelj(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>
Cc: Alexandre Arruda <adaldeia(at)gmail(dot)com>, <pgsql-general(at)lists(dot)postgresql(dot)org>
Subject: Re: ERROR: found multixact from before relminmxid
Date: 2018-11-05 16:41:06
Message-ID: e9dbb964-2da7-1cc6-3ec5-b3684de84066@anayrat.info
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On 6/8/18 8:30 PM, Jeremy Finzel wrote:
>
>  No I was referring to this from the documentation:
>
> Avoid spuriously marking pages as all-visible (Dan Wood, Pavan
> Deolasee, Álvaro Herrera)
>
> This could happen if some tuples were locked (but not deleted).
> While queries would still function correctly, vacuum would normally
> ignore such pages, with the long-term effect that the tuples were
> never frozen. In recent releases this would eventually result in
> errors such as "found multixact nnnnn from before relminmxid nnnnn".
>
>

Hello hackers,

We got the same issue on a 9.6.10, autovacuum reports the same error
when he tried to freeze a table:
ERROR: found multixact 37009963 from before relminmxid 48606990

autovacuum was not able to freeze this table until postgres reach 1
million transactions before wraparound and refuse any new transaction.

We have an OLTP workload and I noticed queries like SELECT .. FOR SHARE.
I checked durability settings (fsync, fpw) everything seems fine. Also,
I did not notice any error on the storage (local ssd with raid controler).

We "solved" with a dump/restore on another server, also we kept previous
cluster to investigate. I want to be sure we encountered the bug solved
in 9.6.9 and it is not a new one.

If we confirm it, maybe we should advise users to perform integrity
check? I was surprised 9.6.9 avoid new appearance of corruption but
nothing to ensure if database is already corrupted.

FYI, we tried to do a pgdump on secondary:
pg_dump: Error message from server: ERROR: could not access status of
transaction 1323321209
DETAIL: Could not open file "pg_clog/04EE": No such file or directory.

I am surprised this clog file is missing on the secondary (but present
on primary) :
[...]
04EB
04EC
04ED
0CEE
0CEF
0CF0
0CF1
[...]

I ran pg_visibility's checks, they do not reveal any corruption :
select pg_check_visible(i) from (select nspname ||'.' || relname as i
from pg_class join pg_namespace on
pg_class.relnamespace=pg_namespace.oid WHERE relkind IN ('r','t','m')) as i;
(0 rows)

select pg_check_frozen(i) from (select nspname ||'.' || relname as i
from pg_class join pg_namespace on
pg_class.relnamespace=pg_namespace.oid WHERE relkind IN ('r','t','m')) as i;
(0 rows)

Here is pg_controldata and information on the table where autovacuum
reports error :
/usr/pgsql-9.6/bin/pg_controldata
pg_control version number: 960
Catalog version number: 201608131
Database system identifier: 6451990604639161176
Database cluster state: in production
pg_control last modified: Mon 05 Nov 2018 12:44:30 PM CET
Latest checkpoint location: B9AF/70A4FD98
Prior checkpoint location: B9AF/70A4B7D0
Latest checkpoint's REDO location: B9AF/70A4FD60
Latest checkpoint's REDO WAL file: 000000010000B9AF00000070
Latest checkpoint's TimeLineID: 1
Latest checkpoint's PrevTimeLineID: 1
Latest checkpoint's full_page_writes: on
Latest checkpoint's NextXID: 1:1323325854
Latest checkpoint's NextOID: 1665327960
Latest checkpoint's NextMultiXactId: 60149587
Latest checkpoint's NextMultiOffset: 174269996
Latest checkpoint's oldestXID: 3471809503
Latest checkpoint's oldestXID's DB: 16393
Latest checkpoint's oldestActiveXID: 1323325854
Latest checkpoint's oldestMultiXid: 48606990
Latest checkpoint's oldestMulti's DB: 16393
Latest checkpoint's oldestCommitTsXid:0
Latest checkpoint's newestCommitTsXid:0
Time of latest checkpoint: Mon 05 Nov 2018 12:44:29 PM CET
Fake LSN counter for unlogged rels: 0/1
Minimum recovery ending location: 0/0
Min recovery ending loc's timeline: 0
Backup start location: 0/0
Backup end location: 0/0
End-of-backup record required: no
wal_level setting: replica
wal_log_hints setting: off
max_connections setting: 750
max_worker_processes setting: 8
max_prepared_xacts setting: 0
max_locks_per_xact setting: 64
track_commit_timestamp setting: off
Maximum data alignment: 8
Database block size: 8192
Blocks per segment of large relation: 131072
WAL block size: 8192
Bytes per WAL segment: 16777216
Maximum length of identifiers: 64
Maximum columns in an index: 32
Maximum size of a TOAST chunk: 1996
Size of a large-object chunk: 2048
Date/time type storage: 64-bit integers
Float4 argument passing: by value
Float8 argument passing: by value
Data page checksum version: 0

SELECT * FROM pg_class WHERE oid = 'anonymised'::regclass;
-[ RECORD 1
]-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
relname | anonymised
relnamespace | 2200
reltype | 16719
reloftype | 0
relowner | 16386
relam | 0
relfilenode | 438244741
reltablespace | 0
relpages | 1823173
reltuples | 6.82778e+07
relallvisible | 1822975
reltoastrelid | 438244744
relhasindex | t
relisshared | f
relpersistence | p
relkind | r
relnatts | 25
relchecks | 0
relhasoids | f
relhaspkey | t
relhasrules | f
relhastriggers | t
relhassubclass | f
relrowsecurity | f
relforcerowsecurity | f
relispopulated | t
relreplident | d
relfrozenxid | 3471809503
relminmxid | 48606990
relacl | {anonymised}
reloptions |
{autovacuum_vacuum_scale_factor=0.1,autovacuum_analyze_scale_factor=0.05}

Do you have any idea to investigate? Or steps to identify previous
corruption on another instance?

Thanks!

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Charles Martin 2018-11-05 16:50:29 Re: Trouble Upgrading Postgres
Previous Message Tom Lane 2018-11-05 16:11:00 Re: Trouble Upgrading Postgres