Re: buffer assertion tripping under repeat pgbench load

From: Greg Smith <greg(at)2ndQuadrant(dot)com>
To: Greg Stark <stark(at)mit(dot)edu>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: buffer assertion tripping under repeat pgbench load
Date: 2012-12-30 03:07:45
Message-ID: 50DFB001.7010000@2ndQuadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 12/27/12 7:43 AM, Greg Stark wrote:
> If it's always the first buffer then it could conceivably still be
> some other heap allocated object that always lands before
> LocalRefCount. It does seem a bit weird to be storing 1<<30 though --
> there are no 1<<30 constants that we might be storing for example.

It is a strange power of two to be appearing there. I can follow your
reasoning for why this could be a bit flipping error. There's no sign
of that elsewhere though, no other crashes under load. I'm using this
server here because it's worked fine for a while now.

I added printing the buffer number, and they're all over the place:

2012-12-27 06:36:39 EST [26306]: WARNING: refcount of buf 29270
containing base/16384/90124 blockNum=82884, flags=0x127 is 1073741824
should be 0, globally: 0
2012-12-27 02:08:19 EST [21719]: WARNING: refcount of buf 114262
containing base/16384/81932 blockNum=133333, flags=0x106 is 1073741824
should be 0, globally: 0
2012-12-26 20:03:05 EST [15117]: WARNING: refcount of buf 142934
containing base/16384/73740 blockNum=87961, flags=0x127 is 1073741824
should be 0, globally: 0

The relation continues to bounce between pgbench_accounts and its
primary key, no pattern there either I can see. To answer a few other
questions: this system does not have ECC RAM. It did survive many
passes of memtest86+ without any problems though, right after the above.

I tried duplicating the problem on a similar server. It keeps hanging
due to some Linux software RAID bug before it runs for very long.
Whatever is going on here, it really doesn't want to be discovered.

For reference sake, the debugging code those latest messages came from
is now:

diff --git a/src/backend/storage/buffer/bufmgr.c
b/src/backend/storage/buffer/bufmgr.c
index dddb6c0..60d3ad3 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1697,11 +1697,27 @@ AtEOXact_Buffers(bool isCommit)
if (assert_enabled)
{
int i;
+ int RefCountErrors = 0;

for (i = 0; i < NBuffers; i++)
{
- Assert(PrivateRefCount[i] == 0);
+
+ if (PrivateRefCount[i] != 0)
+ {
+ /*
+
PrintBufferLeakWarning(&BufferDescriptors[i]);
+ */
+ BufferDesc *bufHdr = &BufferDescriptors[i];
+ elog(WARNING,
+ "refcount of buf %d containing
%s blockNum=%u, flags=0x%x is %u should be 0, globally: %u",
+
i,relpathbackend(bufHdr->tag.rnode, InvalidBackendId, bufHdr->tag.forkNum),
+ bufHdr->tag.blockNum,
bufHdr->flags, PrivateRefCount[i], bufHdr->refcount);
+ RefCountErrors++;
+ }
}
+ if (RefCountErrors > 0)
+ elog(WARNING, "buffers with non-zero refcount is
%d", RefCountErrors);
+ Assert(RefCountErrors == 0);
}
#endif

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Geoghegan 2012-12-30 03:12:34 Re: pg_stat_statements: calls under-estimation propagation
Previous Message Robert Haas 2012-12-30 03:03:42 Re: PATCH: optimized DROP of multiple tables within a transaction