Re: emergency outage requiring database restart

From: Merlin Moncure <mmoncure(at)gmail(dot)com>
To: Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>
Cc: Greg Stark <stark(at)mit(dot)edu>, Ants Aasma <ants(dot)aasma(at)eesti(dot)ee>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: emergency outage requiring database restart
Date: 2016-10-28 21:17:36
Message-ID: CAHyXU0wPDtqgehcJc8PLM8daD6sPAqwy7McMf5OmHxOoaO=iVw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Oct 28, 2016 at 3:16 PM, Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com> wrote:
> On 10/28/16 8:23 AM, Merlin Moncure wrote:
>>
>> On Thu, Oct 27, 2016 at 6:39 PM, Greg Stark <stark(at)mit(dot)edu> wrote:
>>>
>>> On Thu, Oct 27, 2016 at 9:53 PM, Merlin Moncure <mmoncure(at)gmail(dot)com>
>>> wrote:
>>>>
>>>> I think we can rule out faulty storage
>>>
>>>
>>> Nobody ever expects the faulty storage
>
>
> LOL
>
>> Believe me, I know. But the evidence points elsewhere in this case;
>> this is clearly application driven.
>
>
> FWIW, just because it's triggered by specific application behavior doesn't
> mean there isn't a storage bug. That's what makes data corruption bugs such
> a joy to figure out.
>
> BTW, if you haven't already, I would reset all your storage related options
> and GUCs to safe defaults... plain old FSYNC, no cute journal / FS / mount
> options, etc. Maybe this is related to the app, but the most helpful thing
> right now is to find some kind of safe config so you can start bisecting.

upthread, you might have noticed that I already did that. Here is
the other evidence:
*) server running fine for 5+ years
*) other database on same cluster not impacted with 10x write activity
*) no interesting logs reported in /var/log/messages, dmesg, etc
*) san fabric turns over petabytes/day with no corruption. 100+
postgres clusters, 1000+ sql server clusters (and that's not
production)
*) storage/network teams have been through everything. nothing
intersting/unusual to report
*) we have infrequently run routing (posted upthread) that, when run,
database crashed within minutes
*) after turning on checksums, 30% of invocations of routine resulted
in checksum error
*) problem re-occurred after dump-restore and full cluster rebuild
*) checksum error caused routine rollback. FWICT this prevented the damage
*) everything is fine now that routine is not being run anymore

you can come up with your conclusion, I've come up with mine. The
only frustrating thing here is that I can't reproduce out of the
production environment. If this database goes down I have 30 people
sitting around so I can't take downtime lightly.

> I would also consider alternatives to plsh, just to rule it out if nothing
> else. I'd certainly look at some way to get sqsh out of the loop (again,
> just to get something that doesn't crash). First idea that comes to mind is
> a stand-alone shell script that watches a named pipe for a filename; when it
> gets that file it runs it with sqsh and does something to signal completion.

I do a lot of etl to/from sql server and it's all sqsh based. If I
can figure out how to reproduce in a better way, I'll zero in on the
problem in about 10 minutes.

merlin

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Dilip Kumar 2016-10-29 06:47:36 Re: Proposal: scan key push down to heap [WIP]
Previous Message Tom Lane 2016-10-28 20:34:57 Re: Speed of user-defined aggregates using array_append as transfn