Re: HOT chain validation in verify_heapam()

From: Andres Freund <andres(at)anarazel(dot)de>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Aleksander Alekseev <aleksander(at)timescale(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Himanshu Upadhyaya <upadhyaya(dot)himanshu(at)gmail(dot)com>, Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com>
Subject: Re: HOT chain validation in verify_heapam()
Date: 2023-03-23 17:26:07
Message-ID: 20230323172607.y3lejpntjnuis5vv@awork3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2023-03-23 11:41:52 -0400, Robert Haas wrote:
> On Wed, Mar 22, 2023 at 5:56 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
> > I also think it's not quite right that some of checks inside if
> > (ItemIdIsRedirected()) continue in case of corruption, others don't. While
> > there's a later continue, that means the corrupt tuples get added to the
> > predecessor array. Similarly, in the non-redirect portion, the successor
> > array gets filled with corrupt tuples, which doesn't seem quite right to me.
>
> I'm not entirely sure I agree with this. I mean, we should skip
> further checks of the data is so corrupt that further checks are not
> sensible e.g. if the line pointer is just gibberish we can't validate
> anything about the tuple, because we can't even find the tuple.
> However, we don't want to skip further checks as soon as we see any
> kind of a problem whatsoever. For instance, even if the tuple data is
> utter gibberish, that should not and does not keep us from checking
> whether the update chain looks sane. If the tuple header is garbage
> (e.g. xmin and xmax are outside the bounds of clog) then at least some
> of the update-chain checks are not possible, because we can't know the
> commit status of the tuples, but garbage in the tuple data isn't
> really a problem for update chain validation per se.

The cases I was complaining about where metadata that's important. We
shouldn't enter a tuple into lp_valid[] or successor[] if it failed validity
checks - the subsequent error reports that that generates won't be helpful -
or we'll crash.

E.g. continuing after:

rditem = PageGetItemId(ctx.page, rdoffnum);
if (!ItemIdIsUsed(rditem))
report_corruption(&ctx,
psprintf("line pointer redirection to unused item at offset %u",
(unsigned) rdoffnum));

means we'll look into the tuple in the "update chain validation" loop for
unused items. Where it afaict will lead to a crash or bogus results, because:
/* Can only redirect to a HOT tuple. */
next_htup = (HeapTupleHeader) PageGetItem(ctx.page, next_lp);
if (!HeapTupleHeaderIsHeapOnly(next_htup))
{
report_corruption(&ctx,
psprintf("redirected line pointer points to a non-heap-only tuple at offset %u",
(unsigned) nextoffnum));
}

will just dereference the unused item.

> - * Redirects are created by updates, so successor should be
> - * the result of an update.
> + * Redirects are created by HOT updates, so successor should
> + * be the result of an HOT update.
> + *
> + * XXX: HeapTupleHeaderIsHeapOnly() should always imply
> + * HEAP_UPDATED. This should be checked even when the tuple
> + * isn't a target of a redirect.
>
> Hmm, OK. So the question is where to put this check. Maybe inside
> check_tuple_header(), making it independent of the update chain
> validation stuff?

Yes, check_tuple_header sounds sensible to me.

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bharath Rupireddy 2023-03-23 17:33:38 Re: add log messages when replication slots become active and inactive (was Re: Is it worth adding ReplicationSlot active_pid to ReplicationSlotPersistentData?)
Previous Message Bharath Rupireddy 2023-03-23 17:24:40 Re: Add pg_walinspect function with block info columns