Re: emergency outage requiring database restart

From: Merlin Moncure <mmoncure(at)gmail(dot)com>
To: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: emergency outage requiring database restart
Date: 2016-10-19 13:54:48
Message-ID: CAHyXU0zCezq3Zq63GEvDYebW6j8tXoKM4mk54d3jSrQDzyDMNA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Oct 18, 2016 at 8:45 AM, Merlin Moncure <mmoncure(at)gmail(dot)com> wrote:
> On Mon, Oct 17, 2016 at 2:04 PM, Alvaro Herrera
> <alvherre(at)2ndquadrant(dot)com> wrote:
>> Merlin Moncure wrote:
>>
>>> castaging=# CREATE OR REPLACE VIEW vw_ApartmentSample AS
>>> castaging-# SELECT ...
>>> ERROR: 42809: "pg_cast_oid_index" is an index
>>> LINE 11: FROM ApartmentSample s
>>> ^
>>> LOCATION: heap_openrv_extended, heapam.c:1304
>>>
>>> should I be restoring from backups?
>>
>> It's pretty clear to me that you've got catalog corruption here. You
>> can try to fix things manually as they emerge, but that sounds like a
>> fool's errand.
>
> Yeah. Believe me -- I know the drill. Most or all the damage seemed
> to be to the system catalogs with at least two critical tables dropped
> or inaccessible in some fashion. A lot of the OIDs seemed to be
> pointing at the wrong thing. Couple more datapoints here.
>
> *) This database is OLTP, doing ~ 20 tps avg (but very bursty)
> *) Another database on the same cluster was not impacted. However
> it's more olap style and may not have been written to during the
> outage
>
> Now, this infrastructure running this system is running maybe 100ish
> postgres clusters and maybe 1000ish sql server instances with
> approximately zero unexplained data corruption issues in the 5 years
> I've been here. Having said that, this definitely smells and feels
> like something on the infrastructure side. I'll follow up if I have
> any useful info.

After a thorough investigation I now have credible evidence the source
of the damage did not originate from the database itself.
Specifically, this database is mounted on the same volume as the
operating system (I know, I know) and something non database driven
sucked up disk space very rapidly and exhausted the volume -- fast
enough that sar didn't pick it up. Oh well :-) -- thanks for the help

merlin

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2016-10-19 14:25:39 Re: Indirect indexes
Previous Message Pavan Deolasee 2016-10-19 13:53:28 Re: Indirect indexes