Re: emergency outage requiring database restart

From: Ants Aasma <ants(dot)aasma(at)eesti(dot)ee>
To: Merlin Moncure <mmoncure(at)gmail(dot)com>
Cc: Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, Oskari Saarenmaa <os(at)ohmu(dot)fi>, Andres Freund <andres(at)anarazel(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: emergency outage requiring database restart
Date: 2017-08-10 17:01:45
Message-ID: CA+CSw_seDunLPXqczV_5NO1YaOq-89r0fqsCX7zsEba8cmyeOg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Jan 18, 2017 at 4:33 PM, Merlin Moncure <mmoncure(at)gmail(dot)com> wrote:
> On Wed, Jan 18, 2017 at 4:11 AM, Ants Aasma <ants(dot)aasma(at)eesti(dot)ee> wrote:
>> On Wed, Jan 4, 2017 at 5:36 PM, Merlin Moncure <mmoncure(at)gmail(dot)com> wrote:
>>> Still getting checksum failures. Over the last 30 days, I see the
>>> following. Since enabling checksums FWICT none of the damage is
>>> permanent and rolls back with the transaction. So creepy!
>>
>> The checksums still only differ in least significant digits which
>> pretty much means that there is a block number mismatch. So if you
>> rule out filesystem not doing its job correctly and transposing
>> blocks, it could be something else that is resulting in blocks getting
>> read from a location that happens to differ by a small multiple of
>> page size. Maybe somebody is racily mucking with table fd's between
>> seeking and reading. That would explain the issue disappearing after a
>> retry.
>>
>> Maybe you can arrange for the RelFileNode and block number to be
>> logged for the checksum failures and check what the actual checksums
>> are in data files surrounding the failed page. If the requested block
>> number contains something completely else, but the page that follows
>> contains the expected checksum value, then it would support this
>> theory.
>
> will do. Main challenge is getting hand compiled server to swap in
> so that libdir continues to work. Getting access to the server is
> difficult as is getting a maintenance window. I'll post back ASAP.

As a new datapoint, we just had a customer with an issue that I think
might be related. The issue was reasonably repeatable by running a
report on the standby system. Issue manifested itself by first "could
not open relation" and/or "column is not in index" errors, followed a
few minutes later by a PANIC from startup process due to "specified
item offset is too large", "invalid max offset number" or "page X of
relation base/16384/1259 is uninitialized". I took a look at the xlog
dump and it was completely fine. For instance in the "specified item
offset is too large" case there was a INSERT_LEAF redo record
inserting the preceding offset just a couple hundred kilobytes back.
Restarting the server sometimes successfully applied the offending
WAL, sometimes it failed with other corruption errors. The offending
relations were always pg_class or pg_class_oid_index. Replacing plsh
functions with dummy plpgsql functions made the problem go away,
reintroducing plsh functions made it reappear.

The only thing I came up with that is consistent with the symptoms is
that a page got thrown out of shared_buffers between the two xlog
records referencing it (shared_buffers was default 128MB), and then
read back by a backend process, where in the time between FileSeek and
FileRead calls in mdread a subprocess mucked with the fd's offset so
that a different page than intended got read in. Or basically the same
race condition, but on the write side. Maybe somebody else has a
better imagination than me...

Regards,
Ants Aasma

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2017-08-10 17:02:17 Re: WIP: Failover Slots
Previous Message Josh Berkus 2017-08-10 16:40:29 Re: Quorum commit for multiple synchronous replication.