Re: pg_amcheck contrib application

From: Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Noah Misch <noah(at)leadboat(dot)com>, Peter Geoghegan <pg(at)bowt(dot)ie>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, "Andrey M(dot) Borodin" <x4mmm(at)yandex-team(dot)ru>, Stephen Frost <sfrost(at)snowman(dot)net>, Michael Paquier <michael(at)paquier(dot)xyz>, Amul Sul <sulamul(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: pg_amcheck contrib application
Date: 2021-03-18 04:00:39
Message-ID: 4F10540A-9022-49EE-A6FC-7FE90797793E@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> On Mar 16, 2021, at 12:52 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>
> On Mon, Mar 15, 2021 at 10:10 PM Mark Dilger
> <mark(dot)dilger(at)enterprisedb(dot)com> wrote:
>> It is unfortunate that the failing test only runs pg_amcheck after creating numerous corruptions, as we can't know if pg_amcheck would have complained about pg_statistic before the corruptions were created in other tables, or if it only does so after. The attached patch v7-0003 adds a call to pg_amcheck after all tables are created and populated, but before any corruptions are caused. This should help narrow down what is happening, and doesn't hurt to leave in place long-term.
>>
>> I don't immediately see anything wrong with how pg_statistic uses a pseudo-type, but it leads me to want to poke a bit more at pg_statistic on hornet and tern, though I don't have any regression tests specifically for doing so.
>>
>> Tests v7-0001 and v7-0002 are just repeats of the tests posted previously.
>
> Since we now know that shutting autovacuum off makes the problem go
> away, I don't see a reason to commit 0001. We should fix pg_amcheck
> instead, if, as presently seems to be the case, that's where the
> problem is.

If you get unlucky, autovacuum will process one of the tables that the test intentionally corrupted, with bad consequences, ultimately causing build farm intermittent test failures. We could wait to see if this ever happens before fixing it, if you prefer. I'm not sure what that buys us, though.

The right approach, I think, is to extend the contrib/amcheck tests to include regressing this particular case to see if it fails. I've written that test and verified that it fails against the old code and passes against the new.

> I just committed 0002.

Thanks!

> I think 0003 is probably a good idea, but I haven't committed it yet.

It won't do anything for us in this particular case, but it might make debugging failures quicker in the future.

> As for 0004, it seems to me that we might want to do a little more
> rewording of these messages and perhaps we should try to do it all at
> once. Like, for example, your other change to print out the toast
> value ID seems like a good idea, and could apply to any new messages
> as well as some existing ones. Maybe there are also more fields in the
> TOAST pointer for which we could add checks.

Of the toast pointer fields:

int32 va_rawsize; /* Original data size (includes header) */
int32 va_extsize; /* External saved size (doesn't) */
Oid va_valueid; /* Unique ID of value within TOAST table */
Oid va_toastrelid; /* RelID of TOAST table containing it */

all seem worth getting as part of any toast error message, even if these fields themselves are not corrupt. It just makes it easier to understand the context of the error you're looking at. At first I tried putting these into each message, but it is very wordy to say things like "toast pointer with rawsize %u and extsize %u pointing at relation with oid %u" and such. It made more sense to just add these four fields to the verify_heapam tuple format. That saves putting them in the message text itself, and has the benefit that you could filter the rows coming from verify_heapam() for ones where valueid is or is not null, for example. This changes the external interface of verify_heapam, but I didn't bother with a amcheck--1.3--1.4.sql because amcheck--1.2--1.3. sql was added as part of the v14 development work and has not yet been released. My assumption is that I can just change it, rather than making a new upgrade file.

These patches fix the visibility rules and add extra toast checking. Some of the previous patch material is not included, since it is not clear to me if you wanted to commit any of it.

Attachment Content-Type Size
v9-0001-Fixing-amcheck-tuple-visibility-rules.patch application/octet-stream 5.6 KB
v9-0002-pg_amcheck-provide-additional-toast-corruption-in.patch application/octet-stream 32.4 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Pavel Stehule 2021-03-18 04:09:38 Re: pl/pgsql feature request: shorthand for argument and local variable references
Previous Message Michael Paquier 2021-03-18 03:56:12 Re: PITR promote bug: Checkpointer writes to older timeline