From: | Merlin Moncure <mmoncure(at)gmail(dot)com> |
---|---|
To: | Bruce Momjian <bruce(at)momjian(dot)us> |
Cc: | Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: emergency outage requiring database restart |
Date: | 2016-10-19 19:39:02 |
Message-ID: | CAHyXU0w2=jfeK0x6kRMqWC0qAL4upWi_cnVdoM12yq9wVXnxkw@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Wed, Oct 19, 2016 at 9:56 AM, Bruce Momjian <bruce(at)momjian(dot)us> wrote:
> On Wed, Oct 19, 2016 at 08:54:48AM -0500, Merlin Moncure wrote:
>> > Yeah. Believe me -- I know the drill. Most or all the damage seemed
>> > to be to the system catalogs with at least two critical tables dropped
>> > or inaccessible in some fashion. A lot of the OIDs seemed to be
>> > pointing at the wrong thing. Couple more datapoints here.
>> >
>> > *) This database is OLTP, doing ~ 20 tps avg (but very bursty)
>> > *) Another database on the same cluster was not impacted. However
>> > it's more olap style and may not have been written to during the
>> > outage
>> >
>> > Now, this infrastructure running this system is running maybe 100ish
>> > postgres clusters and maybe 1000ish sql server instances with
>> > approximately zero unexplained data corruption issues in the 5 years
>> > I've been here. Having said that, this definitely smells and feels
>> > like something on the infrastructure side. I'll follow up if I have
>> > any useful info.
>>
>> After a thorough investigation I now have credible evidence the source
>> of the damage did not originate from the database itself.
>> Specifically, this database is mounted on the same volume as the
>> operating system (I know, I know) and something non database driven
>> sucked up disk space very rapidly and exhausted the volume -- fast
>> enough that sar didn't pick it up. Oh well :-) -- thanks for the help
>
> However, disk space exhaustion should not lead to corruption unless the
> underlying layers lied in some way.
I agree -- however I'm sufficiently separated from the things doing
the things that I can't verify that in any real way. In the meantime
I'm going to take standard precautions (enable checksums/dedicated
volume/replication). Low disk space also does not explain the bizarre
outage I had last friday.
merlin
From | Date | Subject | |
---|---|---|---|
Next Message | Claudio Freire | 2016-10-19 22:06:03 | Re: Indirect indexes |
Previous Message | Josh Berkus | 2016-10-19 18:55:30 | Re: Remove vacuum_defer_cleanup_age |