Re: BUG #6425: Bus error in slot_deform_tuple

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Duncan Rance <postgres(at)dunquino(dot)com>, pgsql-bugs(at)postgresql(dot)org, Robert Haas <robertmhaas(at)gmail(dot)com>, Bridget Frey <bridget(dot)frey(at)redfin(dot)com>
Subject: Re: BUG #6425: Bus error in slot_deform_tuple
Date: 2012-02-03 06:45:13
Message-ID: 12379.1328251513@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs pgsql-hackers

I wrote:
> I have not gotten very far with the coredump, except to observe that
> gdb says the Assert ought to have passed: ...
> This suggests very strongly that indeed the buffer was changing under
> us.

I probably ought to let the test case run overnight before concluding
anything, but at this point it's run for two-plus hours with no errors
after applying this patch:

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index cce87a3..b128bfd 100644
*** a/src/backend/access/transam/xlog.c
--- b/src/backend/access/transam/xlog.c
*************** RestoreBkpBlocks(XLogRecPtr lsn, XLogRec
*** 3716,3724 ****
}
else
{
- /* must zero-fill the hole */
- MemSet((char *) page, 0, BLCKSZ);
memcpy((char *) page, blk, bkpb.hole_offset);
memcpy((char *) page + (bkpb.hole_offset + bkpb.hole_length),
blk + bkpb.hole_offset,
BLCKSZ - (bkpb.hole_offset + bkpb.hole_length));
--- 3716,3724 ----
}
else
{
memcpy((char *) page, blk, bkpb.hole_offset);
+ /* must zero-fill the hole */
+ MemSet((char *) page + bkpb.hole_offset, 0, bkpb.hole_length);
memcpy((char *) page + (bkpb.hole_offset + bkpb.hole_length),
blk + bkpb.hole_offset,
BLCKSZ - (bkpb.hole_offset + bkpb.hole_length));

The existing code makes the page state transiently invalid (all zeroes)
for no particularly good reason, and consumes useless cycles to do so,
so this would be a good change in any case. The reason it is relevant
to our current problem is that even though RestoreBkpBlocks faithfully
takes exclusive lock on the buffer, *that is not enough to guarantee
that no one else is touching that buffer*. Another backend that has
already located a visible tuple on a page is entitled to keep accessing
that tuple with only a buffer pin. So the existing code transiently
wipes the data from underneath the other backend's pin.

It's clear how this explains the symptoms I saw (Assert reporting wrong
value of lp_flags even though the backend must previously have seen the
right value, and the eventual coredump captured the right value too).
It's less clear though whether this explains the symptoms seen by Duncan
and Bridget. They presumably are running without asserts enabled, so
it's unsurprising that they don't see the Assert failure, but what
happens if control gets past that? There seem to be several possible
failure modes:

* Reader picks up zero lp_off/lp_len from the line pointer, and then
tries to interpret the page header as a tuple. The results would be
predictable only until RestoreBkpBlocks puts back nonzero data there,
and then it's a bit of a mess. (In particular, t_hoff would be read out
of the pd_prune_xid field if I counted right, and so would have a rather
unpredictable value.)

* Reader finds correct location of tuple, but sees t_hoff and/or
t_infomask as zeroes (the latter possibly causing it to not check for
nulls, if it doesn't think HEAP_HASNULL is set). Until RestoreBkpBlocks
puts back the data, this would devolve to the next case, but after that
it's a bit unpredictable again.

* Reader finds correct location of data, but sees zeroes there.

I believe that the reported failures involving palloc(-3) in
text_to_cstring can be explained as instances of seeing zeroes where a
text or varchar value is expected. Zeroes would look like a long-format
varlena header with value zero, and the code would subtract 4 to get the
data length, then add 1 for the trailing NUL byte that's needed in
cstring representation, and thus ask for -3 bytes for the cstring
equivalent. Furthermore, all three of the above cases end up doing that
as long as the page stays all-zero for long enough. If some nonzero
data gets restored before we're done examining the page, you get some
other behavior of doubtful predictability. Maybe the other reported
symptoms always fall out of that, or maybe not --- it seems surprising
that we don't have a wider set of visible misbehaviors.

Whether or not this single bug explains all the cases reported so far,
it certainly seems possible that there are other mistakes of the same
sort in individual WAL replay routines. I think we'd better go over
all of them with a fine-tooth comb. In general, a WAL replay routine
can no longer be allowed to create transiently invalid page states
that would not have occurred in the "live" version of the page change.

I am even more troubled than I was before about what this says about
the amount of testing Hot Standby has gotten, because AFAICS absolutely
any use of Hot Standby, no matter the particulars, ought to be heavily
exposed to this bug.

regards, tom lane

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Bridget Frey 2012-02-03 07:27:38 Re: BUG #6425: Bus error in slot_deform_tuple
Previous Message Bruce Momjian 2012-02-03 01:48:21 Re: Re: [PATCH] Use CC atomic builtins if available [was: Re: TAS patch for building on armel/armhf thumb]

Browse pgsql-hackers by date

  From Date Subject
Next Message Bridget Frey 2012-02-03 07:27:38 Re: BUG #6425: Bus error in slot_deform_tuple
Previous Message Fujii Masao 2012-02-03 05:22:35 Re: Hot standby fails if any backend crashes