Re: Recovery inconsistencies, standby much larger than primary

From: Greg Stark <stark(at)mit(dot)edu>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)2ndquadrant(dot)com>
Subject: Re: Recovery inconsistencies, standby much larger than primary
Date: 2014-01-31 20:28:31
Message-ID: CAM-w4HObtoH7vekEP6W5C-CCie26CDNyAXK8G3vPcVTWxZdGtw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

One thing I keep coming back to is a bad ran chip setting a bit in the
block number. But I just can't seem to get it to add up. The difference is
not a power of two, it had happened on two different machines, and we don't
see other weirdness on the machine. It seems like a strange coincidence it
would happen to the same variable twice and not to other variables.

Unless there's some unrelated code writing through a wild pointer, possibly
to a stack allocated object that just happens to often be that variable?

--
greg
On 31 Jan 2014 20:21, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> Greg Stark <stark(at)mit(dot)edu> writes:
> > So just to summarize, this xlog record:
> > [cur:EA1/637140, xid:1418089147, rmid:11(Btree), len/tot_len:18/6194,
> > info:8, prev:EA1/635290] insert_leaf: s/d/r:1663/16385/1261982 tid
> > 3634978/282
> > [cur:EA1/637140, xid:1418089147, rmid:11(Btree), len/tot_len:18/6194,
> > info:8, prev:EA1/635290] bkpblock[1]: s/d/r:1663/16385/1261982
> > blk:3634978 hole_off/len:1240/2072
>
> > Appears to have been written to [ block 7141472 ]
>
> I've been staring at the code for a bit trying to guess how that could
> have happened. Since the WAL record has a backup block, btree_xlog_insert
> would have passed control to RestoreBackupBlock, which would call
> XLogReadBufferExtended with mode RBM_ZERO, so there would be no complaint
> about writing past the end of the relation. Now, you can imagine some
> very low-level error causing a write to go to the wrong page due to a seek
> problem or some such, but it's hard to credit that that would've resulted
> in creation of all the intervening segment files. Some level of our code
> had to have thought it was being told to extend the relation.
>
> However, on closer inspection I was a bit surprised to realize that there
> are two possible candidates for doing that! XLogReadBufferExtended will
> extend the relation, a block at a time, if told to write a page past
> the current nominal EOF. And in md.c, _mdfd_getseg will *also* extend
> the relation if we're InRecovery, even though it normally would not do
> so when called from mdwrite().
>
> Given the behavior in XLogReadBufferExtended, I rather think that the
> InRecovery special case in _mdfd_getseg is dead code and should be
> removed. But for the purpose at hand, it's more interesting to try to
> confirm which of these code levels did the extension. I notice that
> _mdfd_getseg only bothers to write the last physical page of each segment,
> whereas XLogReadBufferExtended knows nothing of segments and will
> ploddingly write every page. So on a filesystem that supports "holes"
> in files, I'd expect that the added segments would be fully allocated
> if XLogReadBufferExtended did the deed, but they'd be quite small if
> _mdfd_getseg did so. The du results you started with suggest that the
> former is the case, but could you verify that the filesystem this is
> on supports holes and that du will report only the actually allocated
> space when there's a hole?
>
> Assuming that the extension was done in XLogReadBufferExtended, we are
> forced to the conclusion that XLogReadBufferExtended was passed a bad
> block number (viz 7141472); and it's pretty hard to see how that could
> happen. RestoreBackupBlock is just passing the value it got out of the
> WAL record. I thought about the idea that it was wrong about exactly
> where the BkpBlock struct was in the record, but that would presumably
> lead to garbage relnode and fork numbers not just a bad block number.
>
> So I'm still baffled ...
>
> regards, tom lane
>

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Anirudh 2014-01-31 20:35:58 Re: Regarding google summer of code
Previous Message Merlin Moncure 2014-01-31 19:48:54 Re: jsonb and nested hstore