Re: emergency outage requiring database restart

From: Andres Freund <andres(at)anarazel(dot)de>
To: Merlin Moncure <mmoncure(at)gmail(dot)com>,Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>,Bruce Momjian <bruce(at)momjian(dot)us>,PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: emergency outage requiring database restart
Date: 2016-10-26 18:34:30
Message-ID: 147977C4-6107-47C6-9628-475EC6263E2C@anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On October 26, 2016 8:57:22 PM GMT+03:00, Merlin Moncure <mmoncure(at)gmail(dot)com> wrote:
>On Wed, Oct 26, 2016 at 12:43 PM, Merlin Moncure <mmoncure(at)gmail(dot)com>
>wrote:
>> On Wed, Oct 26, 2016 at 11:35 AM, Merlin Moncure <mmoncure(at)gmail(dot)com>
>wrote:
>>> On Tue, Oct 25, 2016 at 3:08 PM, Merlin Moncure <mmoncure(at)gmail(dot)com>
>wrote:
>>>> Confirmation of problem re-occurrence will come in a few days.
>I'm
>>>> much more likely to believe 6+sigma occurrence (storage, freak bug,
>>>> etc) should it prove the problem goes away post rebuild.
>>>
>>> ok, no major reported outage yet, but just got:
>>>
>>> 2016-10-26 11:27:55 CDT [postgres(at)castaging]: ERROR: invalid page
>in
>>> block 12 of relation base/203883/1259
>
>*) I've now strongly correlated this routine with the damage.
>
>[root(at)rcdylsdbmpf001 ~]# cat
>/var/lib/pgsql/9.5/data/pg_log/postgresql-26.log | grep -i
>pushmarketsample | head -5
>2016-10-26 11:26:27 CDT [postgres(at)castaging]: LOG: execute <unnamed>:
>SELECT PushMarketSample($1::TEXT) AS published
>2016-10-26 11:26:40 CDT [postgres(at)castaging]: LOG: execute <unnamed>:
>SELECT PushMarketSample($1::TEXT) AS published
>PL/pgSQL function pushmarketsample(text,date,integer) line 103 at SQL
>statement
>PL/pgSQL function pushmarketsample(text,date,integer) line 103 at SQL
>statement
>2016-10-26 11:26:42 CDT [postgres(at)castaging]: STATEMENT: SELECT
>PushMarketSample($1::TEXT) AS published
>
>*) First invocation was 11:26:27 CDT
>
>*) Second invocation was 11:26:40 and gave checksum error (as noted
>earlier 11:26:42)
>
>*) Routine attached (if interested)
>
>My next step is to set up test environment and jam this routine
>aggressively to see what happens.

Any chance that plsh or the script it executes does anything with the file descriptors it inherits? That'd certainly one way to get into odd corruption issues.

We processor really should use O_CLOEXEC for the majority of it file handles.

Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Merlin Moncure 2016-10-26 18:38:49 Re: emergency outage requiring database restart
Previous Message Robert Haas 2016-10-26 18:33:42 Re: Issues with building snap packages and psql