Re: buffer assertion tripping under repeat pgbench load

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Greg Stark <stark(at)mit(dot)edu>
Cc: Greg Smith <greg(at)2ndquadrant(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: buffer assertion tripping under repeat pgbench load
Date: 2012-12-27 03:17:24
Message-ID: 29851.1356578244@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Greg Stark <stark(at)mit(dot)edu> writes:
> On Wed, Dec 26, 2012 at 11:47 PM, Greg Smith <greg(at)2ndquadrant(dot)com> wrote:
>> It would be nice if this were just something like a memory issue on this
>> system. That I'm getting the same very odd value every time--this refcount
>> of 1073741824--makes it seem less random than I expect from bad memory.
>> Once I get a few more crash samples (with buffer ids) I'll shut the system
>> down for a pass of memtest86+.

> Well that's a one-bit error and it would never get detected until the
> value was decremented down to what should be zero so that's pretty
> much exactly what I would expect to see from a memory or cpu error.

Yeah, the fact that it's always the same bit makes it seem like it could
be one bad physical bit. (Does this machine have ECC memory??)

The thing that this theory has a hard time with is that the buffer's
global refcount is zero. If you assume that there's a bit that
sometimes randomly goes to 1 when it should be 0, then what I'd expect
to typically happen is that UnpinBuffer sees nonzero LocalRefCount and
hence doesn't drop the session's global pin when it should. The only
way that doesn't happen is if decrementing LocalRefCount to zero stores
a nonzero pattern when it should store zero, but nonetheless the CPU
thinks it stored zero. As you say there's some small possibility of a
CPU glitch doing that, but then why is it only happening to
LocalRefCount and not any other similar coding?

At the moment I like the other theory you alluded to, that this is a
wild store from code that thinks it's manipulating some other data
structure entirely. The buffer IDs will help confirm or refute that
perhaps. No idea ATM how we would find the problem if it's like that
...

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2012-12-27 04:13:54 Re: Proposal: Store "timestamptz" of database creation on "pg_database"
Previous Message Stephen Frost 2012-12-27 01:30:29 Re: Proposal: Store "timestamptz" of database creation on "pg_database"