Quick Links

Re: incorrect resource manager data checksum in record

From:	Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
To:	Devin Christensen <quixoten(at)gmail(dot)com>
Cc:	"pgsql-generallists(dot)postgresql(dot)org" <pgsql-general(at)lists(dot)postgresql(dot)org>
Subject:	Re: incorrect resource manager data checksum in record
Date:	2018-06-28 22:13:05
Message-ID:	CAEepm=0WPQgzt9yNHYJxJKp_qgk-jr-JX4u2395K7So3xGMbhA@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-general

On Fri, Jun 29, 2018 at 5:44 AM, Devin Christensen <quixoten(at)gmail(dot)com>
wrote:
> The pattern is the same, regardless of ubuntu or postgresql versions. I'm
> concerned this is somehow a ZFS corruption bug, because the error always
> occurs downstream of the first ZFS node and ZFS is a recent addition. I
> don't know enough about what this error means, and haven't found much
> online. When I restart the nodes effected, replication resumes normally,
> with no known side-effects that I've discovered so far, but I'm no longer
> confident that the data downstream from the primary is valid. Really not
> sure how best to start tackling this issue, and hoping to get some
guidance.
> The error is infrequent. We have 11 total replication chains, and this
error
> has occurred on 5 of those chains in approximately 2 months.

It's possible and sometimes expected to see that error when there has been
a crash, but you didn't mention that. From your description it sounds like
it's happening in the middle of streaming, right? My first thought was
that the filesystem change is surely a red herring. But... I did find this
similar complaint that involves an ext4 primary and a btrfs replica:

https://dba.stackexchange.com/questions/116569/postgresql-docker-incorrect-resource-manager-data-checksum-in-record-at-46f-6

I'm having trouble imagining how the filesystem could be triggering a
problem though (unless ZoL is dramatically less stable than on other
operating systems, "ZFS ate my bytes" seems like a super unlikely theory).
Perhaps by being slower, it triggers a bug elsewhere? We did have a report
recently of ZFS recycling WAL files very slowly (presumably because when it
moves the old file to become the new file, it finishes up slurping it back
into memory even though we're just going to overwrite it, and it can't see
that because our writes don't line up with the ZFS record size, possibly
unlike ye olde write-in-place 4k block filesystems, but that's just my
guess). Does your machine have ECC RAM?

--
Thomas Munro
http://www.enterprisedb.com

In response to

incorrect resource manager data checksum in record at 2018-06-28 17:44:03 from Devin Christensen

Responses

Re: incorrect resource manager data checksum in record at 2018-06-29 01:14:05 from Devin Christensen
pgloader question - postgis support at 2018-06-29 02:33:19 from Brent Wood

Browse pgsql-general by date

	From	Date	Subject
Next Message	Devin Christensen	2018-06-29 01:14:05	Re: incorrect resource manager data checksum in record
Previous Message	Laurenz Albe	2018-06-28 21:17:44	Re: Analyze plan of foreign data wrapper