Re: pg_amcheck contrib application

From: Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Peter Geoghegan <pg(at)bowt(dot)ie>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Andrey M(dot) Borodin" <x4mmm(at)yandex-team(dot)ru>, Stephen Frost <sfrost(at)snowman(dot)net>, Michael Paquier <michael(at)paquier(dot)xyz>, Amul Sul <sulamul(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: pg_amcheck contrib application
Date: 2021-03-11 04:02:00
Message-ID: E8A89327-C1B1-4AE2-AEC6-871D39843627@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> On Mar 10, 2021, at 12:28 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>
> On Wed, Mar 10, 2021 at 11:10 AM Mark Dilger
> <mark(dot)dilger(at)enterprisedb(dot)com> wrote:
>> Once again, I think you are right and have removed the objectionable behavior, but....
>>
>> The --startblock and --endblock options make the most sense when the user is only checking one table, like
>>
>> pg_amcheck --startblock=17 --endblock=19 --table=my_schema.my_corrupt_table
>>
>> because the user likely has some knowledge about that table, perhaps from a prior run of pg_amcheck. The --startblock and --endblock arguments are a bit strange when used globally, as relations don't all have the same number of blocks, so
>>
>> pg_amcheck --startblock=17 --endblock=19 mydb
>>
>> will very likely emit lots of error messages for tables which don't have blocks in that range. That's not entirely pg_amcheck's fault, as it just did what the user asked, but it also doesn't seem super helpful. I'm not going to do anything about it in this release.
>
> +1 to all that. I tend toward the opinion that trying to make
> --startblock and --endblock do anything useful in the context of
> checking multiple relations is not really possible, and therefore we
> just shouldn't put any effort into it. But if user feedback shows
> otherwise, we can always do something about it later.
>
>> After running 'make installcheck', if I delete all entries from pg_class where relnamespace = 'pg_toast'::regclass, by running 'pg_amcheck regression', I get lines that look like this:
>>
>> heap relation "regression"."public"."quad_poly_tbl":
>> ERROR: could not open relation with OID 17177
>
> In this here example, the first line ends in a colon.
>
>> relation "regression"."public"."functional_dependencies", block 28, offset 54, attribute 0
>> attribute 0 with length 4294967295 ends at offset 50 beyond total tuple length 43
>
> But this here one does not. Seems like it should be consistent.

Good point. It also seems inconsistent that in one it refers to a "relation" and in the other to a "heap relation", but they're both heap relations. Changed to use "heap relation" both places, and to both use colons.

>
> The QUALIFIED_NAME_FIELDS macro doesn't seem to be used anywhere,
> which is good, because macros with unbalanced parentheses are usually
> not a good plan; and a macro that expands to a comma-separated list of
> things is suspect too.

Yeah, that whole macro was supposed to be removed. Looks like I somehow only removed the end of it, plus some functions that were using it. Not sure how I fat fingered that in the editor, but I've removed the rest now.

> "invalid skip options\n" seems too plural.

Changed to something less plural.

> With regard to your use of strtol() for --{start,end}block, telling
> the user that their input is garbage seems pejorative, even though it
> may be accurate. Compare:
>
> [rhaas EDBAS]$ pg_dump -jdsgdsgd
> pg_dump: error: invalid number of parallel jobs
>
> In the message "relation end block argument precedes start block
> argument\n", I think you could lose both instances of the word
> "argument" and probably the word "relation" as well. I actually don't
> know why all of these messages about start and end block mention
> "relation". It's not like there is some other kind of
> non-relation-related start block with which it could be confused.

Changed.

> The comment for run_command() explains some things about the cparams
> argument, but those things are false. In fact the argument is unused.

Removed unused argument and associated comment.

> Usual PostgreSQL practice when freeing memory in e.g.
> verify_heap_slot_handler is to set the pointers to NULL as well. The
> performance cost of this is trivial, and it makes debugging a lot
> easier should somebody accidentally write code to access one of those
> things after it's been freed.

I had been doing that and removed it, anticipating a complaint about useless code. Ok, I put it back.

> The documentation says that -D "does exclude any database that was
> listed explicitly as dbname on the command line, nor does it exclude
> the database chosen in the absence of any dbname argument." The first
> part of this makes complete sense to me, but I'm not sure about the
> second part. If I type pg_amcheck --all -D 'r*', I think I'm expecting
> that "rhaas" won't be checked. Likewise, if I say pg_amcheck -d
> 'bob*', I think I only want to check the bob-related databases and not
> rhaas.

I think it's a tricky definitional problem. I'll argue the other side for the moment:

If you say `pg_amcheck bob`, I think it is fair to assume that "bob" gets checked. If you say `pg_amcheck bob -d="b*" -D="bo*"`, it is fair to expect all databases starting with /b/ to be checked, except those starting with /bo/, except that since you *explicitly* asked for "bob", that "bob" gets checked. We both agree on this point, I think.

If you say `pg_amcheck --maintenance-db=bob -d="b*" -D="bo*", you don't expect "bob" to get checked, even though it was explicitly stated.

If you are named "bob", and run `pg_amcheck`, you expect it to get your name "bob" from the environment, and check database "bob". It's implicit rather than explicit, but that doesn't change what you expect to happen. It's just a short-hand for saying `pg_amcheck bob`.

Saying that `pg_amcheck -d="b*" -D="bo*" should not check "bob" implies that the database being retrieved from the environment is acting like a maintenance-db. But that's not how it is treated when you just say `pg_amcheck` with no arguments. I think treating it as a maintenance-db in some situations but not in others is strangely non-orthogonal.

On the other hand, I would expect some users to come back with precisely your complaint, so I don't know how best to solve this.

> I suggest documenting --endblock as "Check table blocks up to and
> including the specified ending block number. An error will occur if a
> relation being checked has fewer than this number of blocks." And
> similarly for --startblock: "Check table blocks beginning with the
> specified block number. An error will occur, etc." Perhaps even
> mention something like "This option is probably only useful when
> checking a single table." Also, the documentation here isn't clear
> that this affects only table checking, not index checking.

Changed.

> It appears that pg_amcheck sometimes makes dummy connections to the
> database that don't do anything, e.g. pg_amcheck -t 'q*' resulted in:
>
> 2021-03-10 15:00:14.273 EST [95473] LOG: connection received: host=[local]
> 2021-03-10 15:00:14.274 EST [95473] LOG: connection authorized:
> user=rhaas database=rhaas application_name=pg_amcheck
> 2021-03-10 15:00:14.286 EST [95473] LOG: statement: SELECT
> pg_catalog.set_config('search_path', '', false);
> 2021-03-10 15:00:14.290 EST [95464] DEBUG: forked new backend,
> pid=95474 socket=11
> 2021-03-10 15:00:14.291 EST [95464] DEBUG: server process (PID 95473)
> exited with exit code 0
> 2021-03-10 15:00:14.291 EST [95474] LOG: connection received: host=[local]
> 2021-03-10 15:00:14.293 EST [95474] LOG: connection authorized:
> user=rhaas database=rhaas application_name=pg_amcheck
> 2021-03-10 15:00:14.296 EST [95474] LOG: statement: SELECT
> pg_catalog.set_config('search_path', '', false);
> <...more queries from PID 95474...>
> 2021-03-10 15:00:14.321 EST [95464] DEBUG: server process (PID 95474)
> exited with exit code 0
>
> It doesn't seem to make sense to connect to a database, set the search
> path, exit, and then immediately reconnect to the same database.

I think I've cleaned that up now.

> This is slightly inconsistent:
>
> pg_amcheck: checking heap table "rhaas"."public"."foo"
> heap relation "rhaas"."public"."foo":
> ERROR: XX000: catalog is missing 144 attribute(s) for relid 16392
> LOCATION: RelationBuildTupleDesc, relcache.c:652
> query was: SELECT blkno, offnum, attnum, msg FROM "public".verify_heapam(
> relation := 16392, on_error_stop := false, check_toast := true, skip := 'none')
>
> In line 1 it's a heap table, but in line 2 it's a heap relation.

Changed to use "heap table" consistently, and along those lines, to use "btree index" rather than "btree relation".

Attachment Content-Type Size
v45-0001-Reworking-ParallelSlots-for-mutliple-DB-use.patch application/octet-stream 24.0 KB
v45-0002-Adding-contrib-module-pg_amcheck.patch application/octet-stream 144.3 KB
v45-0003-Extending-PostgresNode-to-test-corruption.patch application/octet-stream 17.0 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message tsunakawa.takay@fujitsu.com 2021-03-11 04:12:57 RE: libpq debug log
Previous Message Amit Kapila 2021-03-11 03:58:45 Re: [PATCH] Provide more information to filter_prepare