Re: pg_amcheck contrib application

From: Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Noah Misch <noah(at)leadboat(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Peter Geoghegan <pg(at)bowt(dot)ie>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, "Andrey M(dot) Borodin" <x4mmm(at)yandex-team(dot)ru>, Stephen Frost <sfrost(at)snowman(dot)net>, Michael Paquier <michael(at)paquier(dot)xyz>, Amul Sul <sulamul(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: pg_amcheck contrib application
Date: 2021-03-16 02:10:37
Message-ID: AA5506CE-7D2A-42E4-A51D-358635E3722D@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> On Mar 15, 2021, at 11:57 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>
> Not sure that I believe the theory that this is from bad luck of
> concurrent autovacuum timing, though. The fact that we're seeing
> this on just those two animals suggests strongly to me that it's
> architecture-correlated, instead.

I find it a little hard to see how amcheck is tripping over a toasted value just in this one table, pg_statistic, and not in any of the others. The error message says the problem is in attribute 27, which I believe means it is stavalues2. The comment in the header for this catalog is intriguing:

/*
* Values in these arrays are values of the column's data type, or of some
* related type such as an array element type. We presently have to cheat
* quite a bit to allow polymorphic arrays of this kind, but perhaps
* someday it'll be a less bogus facility.
*/
anyarray stavalues1;
anyarray stavalues2;
anyarray stavalues3;
anyarray stavalues4;
anyarray stavalues5;

This is hard to duplicate in a test, because you can't normally create tables with pseudo-type columns. However, if amcheck is walking the tuple and does not correctly update the offset with the length of attribute 26, it may try to read attribute 27 at the wrong offset, unsurprisingly leading to garbage, perhaps a garbage toast pointer. The attached patch v7-0004 adds a check to verify_heapam to see if the va_toastrelid matches the expected toast table oid for the table we're reading. That check almost certainly should have been included in the initial version of verify_heapam, so even if it does nothing to help us in this issue, it's good that it be committed, I think.

It is unfortunate that the failing test only runs pg_amcheck after creating numerous corruptions, as we can't know if pg_amcheck would have complained about pg_statistic before the corruptions were created in other tables, or if it only does so after. The attached patch v7-0003 adds a call to pg_amcheck after all tables are created and populated, but before any corruptions are caused. This should help narrow down what is happening, and doesn't hurt to leave in place long-term.

I don't immediately see anything wrong with how pg_statistic uses a pseudo-type, but it leads me to want to poke a bit more at pg_statistic on hornet and tern, though I don't have any regression tests specifically for doing so.

Tests v7-0001 and v7-0002 are just repeats of the tests posted previously.

Attachment Content-Type Size
v7-0001-Turning-off-autovacuum-during-corruption-tests.patch application/octet-stream 1.6 KB
v7-0002-Fixing-a-confusing-amcheck-corruption-message.patch application/octet-stream 2.5 KB
v7-0003-Extend-pg_amcheck-test-suite.patch application/octet-stream 2.4 KB
v7-0004-Add-extra-check-of-toast-pointer-in-amcheck.patch application/octet-stream 1.2 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2021-03-16 02:55:41 Re: shared-memory based stats collector
Previous Message Andres Freund 2021-03-16 02:04:29 Re: shared-memory based stats collector