Re: emergency outage requiring database restart

From: Merlin Moncure <mmoncure(at)gmail(dot)com>
To: Ants Aasma <ants(dot)aasma(at)eesti(dot)ee>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: emergency outage requiring database restart
Date: 2016-10-27 18:53:09
Message-ID: CAHyXU0xT6tJhqouu1AAmPvc40jdvK63-OD9_QJdiRR6tbDwVtQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Oct 27, 2016 at 2:31 AM, Ants Aasma <ants(dot)aasma(at)eesti(dot)ee> wrote:
> On Wed, Oct 26, 2016 at 8:43 PM, Merlin Moncure <mmoncure(at)gmail(dot)com> wrote:
>> /var/lib/pgsql/9.5/data/pg_log/postgresql-26.log | grep "page
>> verification"
>> 2016-10-26 11:26:42 CDT [postgres(at)castaging]: WARNING: page
>> verification failed, calculated checksum 37251 but expected 37244
>> 2016-10-26 11:27:55 CDT [postgres(at)castaging]: WARNING: page
>> verification failed, calculated checksum 37249 but expected 37244
>> 2016-10-26 12:16:44 CDT [postgres(at)castaging]: WARNING: page
>> verification failed, calculated checksum 44363 but expected 44364
>> 2016-10-26 12:18:58 CDT [postgres(at)castaging]: WARNING: page
>> verification failed, calculated checksum 49525 but expected 49539
>> 2016-10-26 12:19:12 CDT [postgres(at)castaging]: WARNING: page
>> verification failed, calculated checksum 37345 but expected 37340
>
> The checksum values are improbably close. The checksum algorithm has
> decently good mixing of all bits in the page. Having the first byte
> match in 5 checksums makes this 1:2^40 improbable. What is not mixed
> in properly is the block number, it only gets xor'ed before packing
> the value into 16bits using modulo 0xFFFF. So I'm pretty sure
> different block numbers were used for writing out and reading in the
> page. Either the blocknum gets corrupted between calculating the
> checksum and writing the page out (unlikely given the proximity), or
> the pages are somehow getting transposed in the storage.

I think we can rule out faulty storage, and I'm reasonably sure
nothing is writing to the database cluster except postgres itself.

So far I have not executed an emergency dump/restore on this database,
nor have I received any other log entries suggesting problems of any
kind. Meaning, FWICT, the database is completely healthy
notwithstanding the reported errors.

This suggests (but does not prove) that the syscatalog damaging
activity happened during the commission of the attached function and
healed itself during the rollback. Previously to turning checksums,
the transaction did *not* roll back and the damage was allowed stand
and probably accumulated.

As noted earlier, I was not able to reproduce the issue with
crashme.sh, which was:

NUM_FORKS=16
do_parallel psql -p 5432 -c"select PushMarketSample('1740')" castaging_test
do_parallel psql -p 5432 -c"select PushMarketSample('4400')" castaging_test
do_parallel psql -p 5432 -c"select PushMarketSample('2160')" castaging_test
do_parallel psql -p 5432 -c"select PushMarketSample('6680')" castaging_test
<snip>

(do_parallel is simple wrapper to executing the command in parallel up
to NUM_FORKS). This is on the same server and cluster as above.
This kind of suggests that either
A) there is some concurrent activity from another process that is
tripping the issue
or
B) there is something particular to the session invoking the function
that is participating in the problem. As the application is
structured, a single threaded node.js app is issuing the query that is
high traffic and long lived. It's still running in fact and I'm kind
of tempted to find some downtime to see if I can still reproduce via
the UI.

merlin

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2016-10-27 19:04:17 Re: [bug fix] Stats collector is not restarted on the standby
Previous Message Robert Haas 2016-10-27 18:19:55 Re: Transactions involving multiple postgres foreign servers