Re: pg_amcheck contrib application

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com>
Cc: Peter Geoghegan <pg(at)bowt(dot)ie>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Andrey M(dot) Borodin" <x4mmm(at)yandex-team(dot)ru>, Stephen Frost <sfrost(at)snowman(dot)net>, Michael Paquier <michael(at)paquier(dot)xyz>, Amul Sul <sulamul(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: pg_amcheck contrib application
Date: 2021-03-10 20:28:39
Message-ID: CA+TgmobwDwvqbpc0OF8MNzQuwAJ2X1D0qEMdBEXXbmmFfzP9+Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Mar 10, 2021 at 11:10 AM Mark Dilger
<mark(dot)dilger(at)enterprisedb(dot)com> wrote:
> Once again, I think you are right and have removed the objectionable behavior, but....
>
> The --startblock and --endblock options make the most sense when the user is only checking one table, like
>
> pg_amcheck --startblock=17 --endblock=19 --table=my_schema.my_corrupt_table
>
> because the user likely has some knowledge about that table, perhaps from a prior run of pg_amcheck. The --startblock and --endblock arguments are a bit strange when used globally, as relations don't all have the same number of blocks, so
>
> pg_amcheck --startblock=17 --endblock=19 mydb
>
> will very likely emit lots of error messages for tables which don't have blocks in that range. That's not entirely pg_amcheck's fault, as it just did what the user asked, but it also doesn't seem super helpful. I'm not going to do anything about it in this release.

+1 to all that. I tend toward the opinion that trying to make
--startblock and --endblock do anything useful in the context of
checking multiple relations is not really possible, and therefore we
just shouldn't put any effort into it. But if user feedback shows
otherwise, we can always do something about it later.

> After running 'make installcheck', if I delete all entries from pg_class where relnamespace = 'pg_toast'::regclass, by running 'pg_amcheck regression', I get lines that look like this:
>
> heap relation "regression"."public"."quad_poly_tbl":
> ERROR: could not open relation with OID 17177

In this here example, the first line ends in a colon.

> relation "regression"."public"."functional_dependencies", block 28, offset 54, attribute 0
> attribute 0 with length 4294967295 ends at offset 50 beyond total tuple length 43

But this here one does not. Seems like it should be consistent.

The QUALIFIED_NAME_FIELDS macro doesn't seem to be used anywhere,
which is good, because macros with unbalanced parentheses are usually
not a good plan; and a macro that expands to a comma-separated list of
things is suspect too.

"invalid skip options\n" seems too plural.

With regard to your use of strtol() for --{start,end}block, telling
the user that their input is garbage seems pejorative, even though it
may be accurate. Compare:

[rhaas EDBAS]$ pg_dump -jdsgdsgd
pg_dump: error: invalid number of parallel jobs

In the message "relation end block argument precedes start block
argument\n", I think you could lose both instances of the word
"argument" and probably the word "relation" as well. I actually don't
know why all of these messages about start and end block mention
"relation". It's not like there is some other kind of
non-relation-related start block with which it could be confused.

The comment for run_command() explains some things about the cparams
argument, but those things are false. In fact the argument is unused.

Usual PostgreSQL practice when freeing memory in e.g.
verify_heap_slot_handler is to set the pointers to NULL as well. The
performance cost of this is trivial, and it makes debugging a lot
easier should somebody accidentally write code to access one of those
things after it's been freed.

The documentation says that -D "does exclude any database that was
listed explicitly as dbname on the command line, nor does it exclude
the database chosen in the absence of any dbname argument." The first
part of this makes complete sense to me, but I'm not sure about the
second part. If I type pg_amcheck --all -D 'r*', I think I'm expecting
that "rhaas" won't be checked. Likewise, if I say pg_amcheck -d
'bob*', I think I only want to check the bob-related databases and not
rhaas.

I suggest documenting --endblock as "Check table blocks up to and
including the specified ending block number. An error will occur if a
relation being checked has fewer than this number of blocks." And
similarly for --startblock: "Check table blocks beginning with the
specified block number. An error will occur, etc." Perhaps even
mention something like "This option is probably only useful when
checking a single table." Also, the documentation here isn't clear
that this affects only table checking, not index checking.

It appears that pg_amcheck sometimes makes dummy connections to the
database that don't do anything, e.g. pg_amcheck -t 'q*' resulted in:

2021-03-10 15:00:14.273 EST [95473] LOG: connection received: host=[local]
2021-03-10 15:00:14.274 EST [95473] LOG: connection authorized:
user=rhaas database=rhaas application_name=pg_amcheck
2021-03-10 15:00:14.286 EST [95473] LOG: statement: SELECT
pg_catalog.set_config('search_path', '', false);
2021-03-10 15:00:14.290 EST [95464] DEBUG: forked new backend,
pid=95474 socket=11
2021-03-10 15:00:14.291 EST [95464] DEBUG: server process (PID 95473)
exited with exit code 0
2021-03-10 15:00:14.291 EST [95474] LOG: connection received: host=[local]
2021-03-10 15:00:14.293 EST [95474] LOG: connection authorized:
user=rhaas database=rhaas application_name=pg_amcheck
2021-03-10 15:00:14.296 EST [95474] LOG: statement: SELECT
pg_catalog.set_config('search_path', '', false);
<...more queries from PID 95474...>
2021-03-10 15:00:14.321 EST [95464] DEBUG: server process (PID 95474)
exited with exit code 0

It doesn't seem to make sense to connect to a database, set the search
path, exit, and then immediately reconnect to the same database.

This is slightly inconsistent:

pg_amcheck: checking heap table "rhaas"."public"."foo"
heap relation "rhaas"."public"."foo":
ERROR: XX000: catalog is missing 144 attribute(s) for relid 16392
LOCATION: RelationBuildTupleDesc, relcache.c:652
query was: SELECT blkno, offnum, attnum, msg FROM "public".verify_heapam(
relation := 16392, on_error_stop := false, check_toast := true, skip := 'none')

In line 1 it's a heap table, but in line 2 it's a heap relation.

That's all I've got.

--
Robert Haas
EDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2021-03-10 20:50:48 Re: [HACKERS] Custom compression methods
Previous Message Tom Lane 2021-03-10 20:25:02 Re: libpq debug log