"PANIC: could not open critical system index 2662" - twice

From: Evgeny Morozov <postgresql3(at)realityexists(dot)net>
To: PostgreSQL General <pgsql-general(at)postgresql(dot)org>
Subject: "PANIC: could not open critical system index 2662" - twice
Date: 2023-04-06 16:41:56
Message-ID: 01020187577238cf-da8c0f4a-3ab9-445a-8c74-31ef51439f30-000000@eu-west-1.amazonses.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Our PostgreSQL 15.2 instance running on Ubuntu 18.04 has crashed with
this error:

2023-04-05 09:24:03.448 UTC [15227] ERROR:  index "pg_class_oid_index"
contains unexpected zero page at block 0
2023-04-05 09:24:03.448 UTC [15227] HINT:  Please REINDEX it.
...
2023-04-05 13:05:25.018 UTC [15437]
root(at)test_behavior_638162834106895162 FATAL:  index "pg_class_oid_index"
contains unexpected zero page at block 0
2023-04-05 13:05:25.018 UTC [15437]
root(at)test_behavior_638162834106895162 HINT:  Please REINDEX it.
... (same error for a few more DBs)
2023-04-05 13:05:25.144 UTC [16965]
root(at)test_behavior_638162855458823077 FATAL:  index "pg_class_oid_index"
contains unexpected zero page at block 0
2023-04-05 13:05:25.144 UTC [16965]
root(at)test_behavior_638162855458823077 HINT:  Please REINDEX it.
...
2023-04-05 13:05:25.404 UTC [17309]
root(at)test_behavior_638162881641031612 PANIC:  could not open critical
system index 2662
2023-04-05 13:05:25.405 UTC [9372] LOG:  server process (PID 17309) was
terminated by signal 6: Aborted
2023-04-05 13:05:25.405 UTC [9372] LOG:  terminating any other active
server processes

We had the same thing happened about a month ago on a different database
on the same cluster. For a while PG actually ran OK as long as you
didn't access that specific DB, but when trying to back up that DB with
pg_dump it would crash every time. At that time one of the disks hosting
the ZFS dataset with the PG data directory on it was reporting errors,
so we thought it was likely due to that.

Unfortunately, before we could replace the disks, PG crashed completely
and would not start again at all, so I had to rebuild the cluster from
scratch and restore from pg_dump backups (still onto the old, bad
disks). Once the disks were replaced (all of them) I just copied the
data to them using zfs send | zfs receive and didn't bother restoring
pg_dump backups again - which was perhaps foolish in hindsight.

Well, yesterday it happened again. The server still restarted OK, so I
took fresh pg_dump backups of the databases we care about (which ran
fine), rebuilt the cluster and restored the pg_dump backups again - now
onto the new disks, which are not reporting any problems.

So while everything is up and running now this error has me rather
concerned. Could the error we're seeing now have been caused by some
corruption in the PG data that's been there for a month (so it could
still be attributed to the bad disk), which should now be fixed by
having restored from backups onto good disks? Could this be a PG bug?
What can I do to figure out why this is happening and prevent it from
happening again? Advice appreciated!

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Imre Samu 2023-04-06 17:45:59 PostgreSQL Mailing list public archives : search not working ...
Previous Message Jehan-Guillaume de Rorthais 2023-04-06 15:41:58 Re: Patroni vs pgpool II