Re: found xmin from before relfrozenxid on pg_catalog.pg_authid

From: Jeremy Finzel <finzelj(at)gmail(dot)com>
To: Maxim Boguk <maxim(dot)boguk(at)gmail(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Peter Geoghegan <pg(at)bowt(dot)ie>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, Postgres General <pgsql-general(at)postgresql(dot)org>
Subject: Re: found xmin from before relfrozenxid on pg_catalog.pg_authid
Date: 2018-05-23 13:52:13
Message-ID: CAMa1XUjOhoxrcdtTPyxBU_CUwjii2Bkc1Fdnto4K-oATDAv4CA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general pgsql-hackers

On Tue, May 22, 2018 at 2:42 PM, Maxim Boguk <maxim(dot)boguk(at)gmail(dot)com> wrote:

>
>
> On Tue, May 22, 2018 at 10:30 PM, Andres Freund <andres(at)anarazel(dot)de>
> wrote:
>
>> Hi,
>>
>> On 2018-05-22 22:18:15 +0300, Maxim Boguk wrote:
>> > On Tue, May 22, 2018 at 9:47 PM, Andres Freund <andres(at)anarazel(dot)de>
>> wrote:
>> > > > select relfrozenxid from pg_class where relname='pg_authid';
>> > > > relfrozenxid
>> > > > --------------
>> > > > 2863429136
>>
>> > ​select txid_current();
>> > txid_current
>> > --------------
>> > 41995913769
>>
>> So that's an xid of 3341208114, if you leave the epoch out. What's
>> ​​
>> autovacuum_freeze_max_age set to in that cluster?
>>
>
> ​postgres=# show autovacuum_freeze_max_age;
> autovacuum_freeze_max_age
> ---------------------------
> 200000000
> (default value I think)​
>
>
>
>> Can you show pg_controldata output, and
>> ​​
>> relminmxid from that cluster?
>>
>
> ​postgres(at)db:~$ /usr/lib/postgresql/9.6/bin/pg_controldata -D
> /var/lib/postgresql/9.6/main
> pg_control version number: 960
> Catalog version number: 201608131
> Database system identifier: 6469368654711450114
> Database cluster state: in production
> pg_control last modified: Tue 22 May 2018 10:20:14 PM MSK
> Latest checkpoint location: CCB5/F9C37950
> Prior checkpoint location: CCB0/43F316B0
> Latest checkpoint's REDO location: CCB1/6706BD88
> Latest checkpoint's REDO WAL file: 000000010000CCB100000067
> Latest checkpoint's TimeLineID: 1
> Latest checkpoint's PrevTimeLineID: 1
> Latest checkpoint's full_page_writes: on
> Latest checkpoint's NextXID: 9:3341161759
> Latest checkpoint's NextOID: 190071899
> Latest checkpoint's NextMultiXactId: 59416233
> Latest checkpoint's NextMultiOffset: 215588532
> Latest checkpoint's oldestXID: 2814548646
> Latest checkpoint's oldestXID's DB: 16400
> Latest checkpoint's oldestActiveXID: 3341161759
> Latest checkpoint's oldestMultiXid: 54264778
> Latest checkpoint's oldestMulti's DB: 16400
> Latest checkpoint's oldestCommitTsXid:2814548646
> Latest checkpoint's newestCommitTsXid:3341161758
> Time of latest checkpoint: Tue 22 May 2018 10:05:16 PM MSK
> Fake LSN counter for unlogged rels: 0/1
> Minimum recovery ending location: 0/0
> Min recovery ending loc's timeline: 0
> Backup start location: 0/0
> Backup end location: 0/0
> End-of-backup record required: no
> wal_level setting: replica
> wal_log_hints setting: on
> max_connections setting: 2000
> max_worker_processes setting: 8
> max_prepared_xacts setting: 0
> max_locks_per_xact setting: 64
> track_commit_timestamp setting: on
> Maximum data alignment: 8
> Database block size: 8192
> Blocks per segment of large relation: 131072
> WAL block size: 8192
> Bytes per WAL segment: 16777216
> Maximum length of identifiers: 64
> Maximum columns in an index: 32
> Maximum size of a TOAST chunk: 1996
> Size of a large-object chunk: 2048
> Date/time type storage: 64-bit integers
> Float4 argument passing: by value
> Float8 argument passing: by value
> Data page checksum version: 0
> ​
> postgres=# select datname,datfrozenxid,datminmxid from pg_database order
> by datname;
> datname | datfrozenxid | datminmxid
> -----------+--------------+------------
> ** | 2815939794 | 54265194
> postgres | 2863429136 | 54280819
> template0 | 3148297669 | 59342106
> template1 | 2816765546 | 59261794
>
>
>
>
>
>>
>> I might be daft here, but it's surely curious that the relfrozenxid from
>> the error and pg_catalog are really different (number of digits):
>> catalog: 2863429136
>> error: 248712603
>>
>>
>> > ​About gdb bt - it's tricky because it is mission critical master db of
>> > huge project.
>> > I'll will try promote backup replica and check is issue persist there
>> and
>> > if yes - we will have our playground for a while, but it will require
>> > sometime to arrange.​
>>
>> You should be ok to just bt that in the running cluster, but I
>> definitely understand if you don't want to do that... I'd appreciate if
>> you set up the a playground, because this seems like something that'll
>> reappear.
>>
>
> ​My gdb kunf-fu exceptionally rusty now, so I better play in sandbox.​
> ​And if error did't exist on fresh promoted replica it will give us useful
> info as well.​
>
> --
> Maxim Boguk
> Senior Postgresql DBA
> http://dataegret.com/ <http://www.postgresql-consulting.com/>
>
> Phone RU: +7 985 433 0000
> Phone UA: +380 99 143 0000
> Phone AU: +61 45 218 5678
>
> LinkedIn: http://www.linkedin.com/pub/maksym-boguk/80/b99/b1b
> Skype: maxim.boguk
>
> "Доктор, вы мне советовали так не делать, но почему мне по-прежнему больно
> когда я так делаю ещё раз?"
>
>
All -

We just noticed that this exact same problem has now returned for us. It
has exactly the same symptoms as my original post (including the fact that
a snapshot does not have the issue).

I am open to all suggestions as to troubleshooting this further!

Thank you,
Jeremy

In response to

Browse pgsql-general by date

  From Date Subject
Next Message a 2018-05-23 14:03:13 Re:How do I copy an element of composite type array into csv file?
Previous Message a 2018-05-23 13:50:28 Re:RE: RE: How do I select composite array element that satisfy specific conditions.

Browse pgsql-hackers by date

  From Date Subject
Next Message Sand Stone 2018-05-23 14:06:41 Re: dsa_allocate() faliure
Previous Message David Rowley 2018-05-23 12:25:24 Re: Subplan result caching