Re: new heapcheck contrib module

From: Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Peter Geoghegan <pg(at)bowt(dot)ie>, "Andrey M(dot) Borodin" <x4mmm(at)yandex-team(dot)ru>, Stephen Frost <sfrost(at)snowman(dot)net>, Michael Paquier <michael(at)paquier(dot)xyz>, Amul Sul <sulamul(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: new heapcheck contrib module
Date: 2021-02-01 00:05:15
Message-ID: 70A7BABE-EDC1-4217-8FEB-F446F2938CE1@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> On Jan 28, 2021, at 9:13 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

Attached is patch set 35. Per your review comments, I have restructured the patches in the following way:

v33's 0007 is now the first patch, v35's 0001

v33's 0001 is no more. The frontend infrastructure for error handling and exiting may be resubmitted someday in another patch, but they aren't necessary for pg_amcheck

v33's 0002 is no more. The PGresultHandler stuff that it defined inspires some of what comes later in v35's 0003, but it isn't sufficiently similar to what v35 does to be thought of as moving from v33-0002 into v35-0003.

v33's 0003, 0004 and 0006 are combined into v35's 0002

v33's 0005 becomes v35's 0003

v33's 0007 becomes v35's 0004

Additionally, pg_amcheck testing is extended beyond what v33 had in v35's new 0005 patch, but pg_amcheck doesn't depend on this new 0005 patch ever being committed, so if you don't like it, just throw it in the bit bucket.

>
> I like 0007 quite a bit and am inclined to commit it soon, as it
> doesn't depend on the earlier patches. But:
>
> - I think the residual comment in processSQLNamePattern beginning with
> "Note:" could use some wordsmithing to account for the new structure
> of things -- maybe just "this pass" -> "this function".
> - I suggest changing initializations like maxbuf = buf + 2 to maxbuf =
> &buf[2] for clarity

Already responded to this in the v34 development a few days ago. Nothing meaningfully changes between 34 and 35.

> Regarding 0001:
>
> - My preference would be to dump on_exit_nicely_final() and just rely
> on order of registration.
> - I'm not entirely sure it's a good ideal to expose something named
> fatal() like this, because that's a fairly short and general name. On
> the other hand, it's pretty descriptive and it's not clear why someone
> including exit_utils.h would want any other definitiion. I guess we
> can always change it later if it proves to be problematic; it's got a
> lot of callers and I guess there's no point in churning the code
> without a clear reason.
> - I don't quite see why we need this at all. Like, exit_nicely() is a
> pg_dump-ism. It would make sense to centralize it if we were going to
> use it for pg_amcheck, but you don't. If you were going to, you'd need
> to adapt 0003 to use exit_nicely() instead of exit(), but you don't,
> nor do you add any other new calls to exit_nicely() anywhere, except
> for one in 0002. That makes the PGresultHandler stuff depend on
> exit_nicely(), which might be important if you were going to refactor
> pg_dump to use that abstraction, but you don't. I'm not opposed to the
> idea of centralized exit processing for frontend utilities; indeed, it
> seems like a good idea. But this doesn't seem to get us there. AFAICS
> it just entangles pg_dump with pg_amcheck unnecessarily in a way that
> doesn't really benefit either of them.

Removed from v35.

> Regarding 0002:
>
> - I don't think this is separately committable because it adds an
> abstraction but not any uses of that abstraction to demonstrate that
> it's actually any good. Perhaps it should just be merged into 0005,
> and even into parallel_slot.h vs. having its own header. I'm not
> really sure about that, though

Yeah, this is gone from v35, with hints of it moved into 0003 as part of the parallel slots refactoring.

> - Is this really much of an abstraction layer? Like, how generic can
> this be when the argument list includes ExecStatusType expected_status
> and int expected_ntups?

The new format takes a void *context argument.

> - The logic seems to be very similar to some of the stuff that you
> move around in 0003, like executeQuery() and executeCommand(), but it
> doesn't get unified. I'm not necessarily saying it should be, but it's
> weird to do all this refactoring and end up with something that still
> looks this

Yeah, I agree with this. The refactoring is a lot less ambitious in v35, to avoid these issues.

> 0003, 0004, and 0006 look pretty boring; they are just moving code
> around. Is there any point in splitting the code from 0003 across two
> files? Maybe it's fine.

Combined.

> If I run pg_amcheck --all -j4 do I get a serialization boundary across
> databases? Like, I have to completely finish db1 before I can go onto
> db2, even though maybe only one worker is still busy with it?

The command line interface and corresponding semantics for specifying which tables to check, which schemas to check, and which databases to check should be the same as that for reindexdb and vacuumdb, and the behavior for handing off those targets to be checked/reindexed/vacuumed through the parallel slots interface should be the same. It seems a bit much to refactor reindexdb and vacuumdb to match pg_amcheck when pg_amcheck hasn't been accepted for commit as yet. If/when that happens, and if the project generally approves of going in this direction, I think the next step will be to refactor some of this logic out of pg_amcheck into fe_utils and use it from all three utilities. At that time, I'd like to tackle the serialization choke point in all three, and handle it in the same way for them all.

For the new v35-0005 patch, I have extended PostgresNode.pm with some new corruption abilities. In short, it can now take a snapshot of the files that back a relation, and can corruptly rollback those files to prior versions, in full or in part. This allows creating kinds of corruption that are hard to create through mere bit twiddling. For example, if the relation backing an index is rolled back to a prior version, amcheck's btree checking sees the index as not corrupt, but when asked to reconcile the entries in the heap with the index, it can see that not all of them are present. This gives test coverage of corruption checking functionality that is otherwise hard to achieve.

To check that the PostgresNode.pm changes themselves work, v35-0005 adds src/test/modules/corruption

To check pg_amcheck, and by implication amcheck, v35-0005 adds contrib/pg_amcheck/t/006_relfile_damage.pl

Once again, v35-0005 does not need to be committed -- pg_amcheck works just fine without it.

You and I have discussed this off-list, but for the record, amcheck and pg_amcheck currently only check heaps and btree indexes. Other object types, such as sequences and non-btree indexes, are not checked. Some basic sanity checking of other object types would be a good addition, and pg_amcheck has been structured in a way where it should be fairly straightforward to add support for those. The only such sanity checking that I thought could be done in a short timeframe was to check that the relation files backing the objects were not missing, and we decided off-list such checking wasn't worth much, so I didn't add it.

Attachment Content-Type Size
v35-0001-Refactoring-processSQLNamePattern.patch application/octet-stream 10.7 KB
v35-0002-Moving-code-from-src-bin-scripts-to-fe_utils.patch application/octet-stream 35.3 KB
v35-0003-Parameterizing-parallel-slot-result-handling.patch application/octet-stream 11.3 KB
v35-0004-Adding-contrib-module-pg_amcheck.patch application/octet-stream 130.6 KB
v35-0005-Extending-PostgresNode-to-test-corruption.patch application/octet-stream 16.3 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Euler Taveira 2021-02-01 00:23:04 Re: row filtering for logical replication
Previous Message David Rowley 2021-01-31 23:49:22 Re: [sqlsmith] Failed assertion during partition pruning