Re: amcheck assert failure

From: Grigory Smolkin <g(dot)smolkin(at)postgrespro(dot)ru>
To: pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: amcheck assert failure
Date: 2019-04-24 08:50:02
Message-ID: 39998e3e-816b-f9ed-2a69-6ac7f05dea4a@postgrespro.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On 04/22/2019 11:31 PM, Peter Geoghegan wrote:
> On Mon, Apr 22, 2019 at 12:51 PM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> Nonetheless ... do we really want an assertion failure rather than
>> some more-mundane error report? Seems like the point of amcheck
>> is to detect data corruption, so I'd rather get an ERROR than an
>> assertion (which would not fire in production builds anyway).
> (Apologies for sending this a second time, Tom -- somehow failed to CC
> the list the first time)
>
> I think that the size cross-check must have failed to catch the problem:
>
> /*
> * lp_len should match the IndexTuple reported length exactly, since
> * lp_len is completely redundant in indexes, and both sources of
> * tuple length are MAXALIGN()'d. nbtree does not use lp_len all that
> * frequently, and is surprisingly tolerant of corrupt lp_len fields.
> */
> if (tupsize != ItemIdGetLength(itemid))
> ereport(ERROR,
> (errcode(ERRCODE_INDEX_CORRUPTED),
> errmsg("index tuple size does not equal lp_len in
> index \"%s\"",
> RelationGetRelationName(state->rel)),
> errdetail_internal("Index tid=(%u,%u) tuple
> size=%zu lp_len=%u page lsn=%X/%X.",
> state->targetblock, offset,
> tupsize, ItemIdGetLength(itemid),
> (uint32) (state->targetlsn >> 32),
> (uint32) state->targetlsn),
> errhint("This could be a torn page problem.")));
>
> It seems like the length from tupsize (IndexTupleSize()) happened to
> match the redundant ItemIdGetLength()-reported size from the line
> pointer. No idea how that could actually be possible, since the value
> returned by ItemIdGetLength() would be 12592. What are the chances of
> that accidentally matching what the wild itup pointer reported as its
> IndexTupleSize()?
>
> The only explanation I can think of is that the assertion failure was
> from the core code -- _bt_check_natts() is itself called within core
> code assertions, so it's just about possible that that was how this
> happened, before amcheck was even called. Doesn't seem like a very
> good explanation. Do you think that that's possible, Grigory?

No, it was definitely amcheck:
https://pastebin.postgrespro.ru/?2c8ddf5254b2c625#PTbr+jbxFPu2wfy5SFDWdiqaqPN/N5WX8t6mo5rDpmo=

>
> In any case, you're clearly right that amcheck could be more defensive
> here. My pg_hexedit tool detected that there was an LP_UNUSED item
> pointer with storage (i.e. whose lp_len > 0), which is a check that
> amcheck can and should perform, but doesn't. Especially because it
> would prevent a deference of a likely-wild pointer (IndexTuple).
>
> We could:
>
> * Throw an error when any line item reports lp_len == 0.
>
> * Throw an error when there is a line item that's either LP_UNUSED, or
> LP_REDIRECT. The corrupt page sent by Grigory had both.
>
> What do you think of the idea of committing some simple checks to do
> this with contrib/amcheck on v12?
>

--
Grigory Smolkin
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Jan Skydánek 2019-04-24 11:35:03 bug: evil autoConcat when each string is on new line
Previous Message Amit Langote 2019-04-24 01:30:55 Re: BUG #15672: PostgreSQL 11.1/11.2 crashed after dropping a partition table