Re: hung backends stuck in spinlock heavy endless loop

From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: Merlin Moncure <mmoncure(at)gmail(dot)com>, Peter Geoghegan <pg(at)heroku(dot)com>
Cc: Andres Freund <andres(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: hung backends stuck in spinlock heavy endless loop
Date: 2015-01-16 14:21:28
Message-ID: 54B91E68.7030400@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 01/16/2015 04:05 PM, Merlin Moncure wrote:
> On Thu, Jan 15, 2015 at 5:10 PM, Peter Geoghegan <pg(at)heroku(dot)com> wrote:
>> On Thu, Jan 15, 2015 at 3:00 PM, Merlin Moncure <mmoncure(at)gmail(dot)com> wrote:
>>> Running this test on another set of hardware to verify -- if this
>>> turns out to be a false alarm which it may very well be, I can only
>>> offer my apologies! I've never had a new drive fail like that, in
>>> that manner. I'll burn the other hardware in overnight and report
>>> back.
>
> huh -- well possibly. not. This is on a virtual machine attached to a
> SAN. It ran clean for several (this is 9.4 vanilla, asserts off,
> checksums on) hours then the starting having issues:
>
> [cds2 21952 2015-01-15 22:54:51.833 CST 5502]WARNING: page
> verification failed, calculated checksum 59143 but expected 59137 at
> character 20

The calculated checksum is suspiciously close to to the expected one. It
could be coincidence, but the previous checksum warning you posted was
also quite close:

> [cds2 18347 2015-01-15 15:58:29.955 CST 1779]WARNING: page
> verification failed, calculated checksum 28520 but expected 28541

I believe the checksum algorithm is supposed to mix the bits quite
thoroughly, so that a difference in a single byte in the input will lead
to a completely different checksum. However, we add the block number to
the checksum last:

> /* Mix in the block number to detect transposed pages */
> checksum ^= blkno;
>
> /*
> * Reduce to a uint16 (to fit in the pd_checksum field) with an offset of
> * one. That avoids checksums of zero, which seems like a good idea.
> */
> return (checksum % 65535) + 1;

It looks very much like that a page has for some reason been moved to a
different block number. And that's exactly what Peter found out in his
investigation too; an index page was mysteriously copied to a different
block with identical content.

- Heikki

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Merlin Moncure 2015-01-16 14:21:33 Re: hung backends stuck in spinlock heavy endless loop
Previous Message Tom Lane 2015-01-16 14:17:05 Re: pgsql: Another attempt at fixing Windows Norwegian locale.