Re: emergency outage requiring database restart

From: Merlin Moncure <mmoncure(at)gmail(dot)com>
To: Ants Aasma <ants(dot)aasma(at)eesti(dot)ee>
Cc: Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, Oskari Saarenmaa <os(at)ohmu(dot)fi>, Andres Freund <andres(at)anarazel(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: emergency outage requiring database restart
Date: 2017-08-10 20:02:03
Message-ID: CAHyXU0zk7KpHARJ+ErqnxD+6-kBnnyYb8dnUEpHESwKjGWvd=Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Aug 10, 2017 at 12:01 PM, Ants Aasma <ants(dot)aasma(at)eesti(dot)ee> wrote:
> On Wed, Jan 18, 2017 at 4:33 PM, Merlin Moncure <mmoncure(at)gmail(dot)com> wrote:
>> On Wed, Jan 18, 2017 at 4:11 AM, Ants Aasma <ants(dot)aasma(at)eesti(dot)ee> wrote:
>>> On Wed, Jan 4, 2017 at 5:36 PM, Merlin Moncure <mmoncure(at)gmail(dot)com> wrote:
>>>> Still getting checksum failures. Over the last 30 days, I see the
>>>> following. Since enabling checksums FWICT none of the damage is
>>>> permanent and rolls back with the transaction. So creepy!
>>>
>>> The checksums still only differ in least significant digits which
>>> pretty much means that there is a block number mismatch. So if you
>>> rule out filesystem not doing its job correctly and transposing
>>> blocks, it could be something else that is resulting in blocks getting
>>> read from a location that happens to differ by a small multiple of
>>> page size. Maybe somebody is racily mucking with table fd's between
>>> seeking and reading. That would explain the issue disappearing after a
>>> retry.
>>>
>>> Maybe you can arrange for the RelFileNode and block number to be
>>> logged for the checksum failures and check what the actual checksums
>>> are in data files surrounding the failed page. If the requested block
>>> number contains something completely else, but the page that follows
>>> contains the expected checksum value, then it would support this
>>> theory.
>>
>> will do. Main challenge is getting hand compiled server to swap in
>> so that libdir continues to work. Getting access to the server is
>> difficult as is getting a maintenance window. I'll post back ASAP.
>
> As a new datapoint, we just had a customer with an issue that I think
> might be related. The issue was reasonably repeatable by running a
> report on the standby system. Issue manifested itself by first "could
> not open relation" and/or "column is not in index" errors, followed a
> few minutes later by a PANIC from startup process due to "specified
> item offset is too large", "invalid max offset number" or "page X of
> relation base/16384/1259 is uninitialized". I took a look at the xlog
> dump and it was completely fine. For instance in the "specified item
> offset is too large" case there was a INSERT_LEAF redo record
> inserting the preceding offset just a couple hundred kilobytes back.
> Restarting the server sometimes successfully applied the offending
> WAL, sometimes it failed with other corruption errors. The offending
> relations were always pg_class or pg_class_oid_index. Replacing plsh
> functions with dummy plpgsql functions made the problem go away,
> reintroducing plsh functions made it reappear.

Fantastic. I was never able to attempt to apply O_CLOEXEC patch (see
upthread) due to the fact that access to the system is highly limited
and compiling a replacement binary was a bit of a headache. IIRC this
was the best theory on the table as to the underlying cause and we
ought to to try that first, right?

Reminder; I was able to completely eliminate all damage (but had to
handle occasional unexpected rollback) via enabling checksums.

merlin

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Thomas Munro 2017-08-10 21:53:23 Thoughts on unit testing?
Previous Message Robert Haas 2017-08-10 20:01:39 Re: Proposal: Local indexes for partitioned table