Re: hung backends stuck in spinlock heavy endless loop

From: Merlin Moncure <mmoncure(at)gmail(dot)com>
To: Peter Geoghegan <pg(at)heroku(dot)com>
Cc: Andres Freund <andres(at)2ndquadrant(dot)com>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: hung backends stuck in spinlock heavy endless loop
Date: 2015-01-22 21:50:03
Message-ID: CAHyXU0x7MPmW1v1kqB5Trb_z0no5w5QpK7_qFo0CYvNngyYsbA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Jan 16, 2015 at 5:20 PM, Peter Geoghegan <pg(at)heroku(dot)com> wrote:
> On Fri, Jan 16, 2015 at 10:33 AM, Merlin Moncure <mmoncure(at)gmail(dot)com> wrote:
>> ISTM the next step is to bisect the problem down over the weekend in
>> order to to narrow the search. If that doesn't turn up anything
>> productive I'll look into taking other steps.
>
> That might be the quickest way to do it, provided you can isolate the
> bug fairly reliably. It might be a bit tricky to write a shell script
> that assumes a certain amount of time having passed without the bug
> tripping indicates that it doesn't exist, and have that work
> consistently. I'm slightly concerned that you'll hit other bugs that
> have since been fixed, given the large number of possible symptoms
> here.

Quick update: not done yet, but I'm making consistent progress, with
several false starts. (for example, I had a .conf problem with the
new dynamic shared memory setting and git merrily bisected down to the
introduction of the feature.).
I have to triple check everything :(. The problem is generally
reproducible but I get false negatives that throws off the bisection.
I estimate that early next week I'll have it narrowed down
significantly if not to the exact offending revision.

So far, the 'nasty' damage seems to generally if not always follow a
checksum failure and the checksum failures are always numerically
adjacent. For example:

[cds2 12707 2015-01-22 12:51:11.032 CST 2754]WARNING: page
verification failed, calculated checksum 9465 but expected 9477 at
character 20
[cds2 21202 2015-01-22 13:10:18.172 CST 3196]WARNING: page
verification failed, calculated checksum 61889 but expected 61903 at
character 20
[cds2 29153 2015-01-22 14:49:04.831 CST 4803]WARNING: page
verification failed, calculated checksum 27311 but expected 27316

I'm not up on the intricacies of our checksum algorithm but this is
making me suspicious that we are looking at a improperly flipped
visibility bit via some obscure problem -- almost certainly with
vacuum playing a role. This fits the profile of catastrophic damage
that masquerades as numerous other problems. Or, perhaps, something
is flipping what it thinks is a visibility bit but on the wrong page.

I still haven't categorically ruled out pl/sh yet; that's something to
keep in mind.

In the 'plus' category, aside from flushing out this issue, I've had
zero runtime problems so far aside from the mains problem; bisection
(at least on the 'bad' side) has been reliably engaged by simply
counting the number of warnings/errors/etc in the log. That's really
impressive.

merlin

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2015-01-22 21:58:17 Re: basebackups during ALTER DATABASE ... SET TABLESPACE ... not safe?
Previous Message David G Johnston 2015-01-22 21:46:37 Re: Proposal: knowing detail of config files via SQL