Re: regression test failed when enabling checksum

From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Jeff Davis <pgsql(at)j-davis(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: regression test failed when enabling checksum
Date: 2013-04-03 16:48:54
Message-ID: CAMkU=1x=261iP1rJz8Z1YJBqnnNUGtJ9yMUaLcQqxKkVKu8iDg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Apr 3, 2013 at 2:31 AM, Andres Freund <andres(at)2ndquadrant(dot)com>wrote:

>
>
I just checked and unfortunately your dump doesn't contain all that much
> valid WAL:
> ...
>

> So just two checkpoint records.
>
> Unfortunately I fear that won't be enough to diagnose the problem,
> could you reproduce it with a higher wal_keep_segments?
>

I've been trying, but see message "commit dfda6ebaec67 versus
wal_keep_segments".

Looking at some of the log files more, I see that vacuum is involved, but
in some way I don't understand. The crash always happens on a test
cycle immediately after the sleep that allows the autovac to kick in and
finish. So the events goes something like this:

...
run the frantic updating of "foo" until crash
recovery
query "foo" and verify the results are consistent with expectations
sleep to allow autovac to do its job.
truncate "foo" and repopulate it.
run the frantic updating of "foo" until crash
recovery
attempt to query "foo" but get the checksum failure.

What the vacuum is doing that corrupts the system in a way that survives
the truncate is a mystery to me.

Also, at one point I had the harness itself exit as soon as it detected the
problem, but I failed to have it shut down the server. So the server keep
running idle and having autovac do its thing, which produced some
interesting log output:

WARNING: relation "foo" page 45 is uninitialized --- fixing
WARNING: relation "foo" page 46 is uninitialized --- fixing
...
WARNING: relation "foo" page 72 is uninitialized --- fixing
WARNING: relation "foo" page 73 is uninitialized --- fixing
WARNING: page verification failed, calculated checksum 54570 but expected
34212
ERROR: invalid page in block 74 of relation base/16384/4931589

This happened 3 times. Every time, the warnings started on page 45, and
they continued up until the invalid page was found (which varied, being 74,
86, and 74 again)

I wonder if the bug is in checksums, or if the checksums are doing their
job by finding some other bug. And why did those uninitialized pages
trigger warnings when they were autovacced, but not when they were seq
scanned in a query?

Cheers,

Jeff

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2013-04-03 16:51:36 Re: [PATCH] Exorcise "zero-dimensional" arrays (Was: Re: Should array_length() Return NULL)
Previous Message Andrew Dunstan 2013-04-03 16:45:44 Re: [PATCH] Exorcise "zero-dimensional" arrays (Was: Re: Should array_length() Return NULL)