Re: BUG #17212: pg_amcheck fails on checking temporary relations

From: Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com>
To: Peter Geoghegan <pg(at)bowt(dot)ie>
Cc: Alexander Lakhin <exclusion(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>
Subject: Re: BUG #17212: pg_amcheck fails on checking temporary relations
Date: 2021-10-05 00:32:54
Message-ID: 81019B2A-C0EB-4890-BA3A-DB60590AAF43@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs pgsql-hackers

> On Oct 4, 2021, at 4:45 PM, Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
>
> I'm guessing that you meant REINDEX CONCURRENTLY.

Yes.

> Since you're talking about the case where it has an error

Sorry, I realized after hitting <send> that you might take it that way, but I mean the logs generally, not just postgres logs. If somebody runs "reindex concurrently" on all tables at midnight every night, and they see one morning in the (non-postgres) logs from the midnight hour weird error messages about their RAID controller, they may well want to check all their indexes to see if any of them were corrupted. This is a totally made-up example, but the idea that a user might want to check their indexes, tables, or both owing to errors of some kind is not far-fetched.

> , the whole
> index build must have failed. So the user would get exactly what
> they'd expect -- verification of the original index, without any
> hindrance from the new/failed index.

Right, in that case, but not if hardware errors corrupted the index, and generated logs, without happening to trip up the reindex concurrently statement itself.

> (Thinks for a moment...)
>
> Actually, I think that we'd only verify the original index, even
> before the error with CONCURRENTLY (though I've not checked that point
> myself).

To get back on track, let me say that I'm not taking the position that what pg_amcheck currently does is necessarily correct, but just that I'd like to be careful about what we change, if anything. There are three things that seem to irritate people:

1) A non-zero exit code means "not everything was checked and passed" rather than "at least one thing is definitely corrupt".

2) Skipping of indexes is reported to the user with the word 'ERROR' in the report rather than, say, 'NOTICE'.

3) Deadlocks can occur

I have resisted changing #1 on the theory that `pg_amcheck --all && ./post_all_checks_pass.sh` should only run the post_all_checks_pass.sh if indeed all checks have passed, and I'm interpreting skipping an index check as being contrary to that. But maybe that's wrong of me. I don't know. There is already sloppiness between the time that pg_amcheck resolves which database relations are matched by --all, --table, --index, etc. and the time that all the checks are started, and again between that time and when the last one is complete. Database objects could be created or dropped during those spans of time, in which case --all doesn't have quite so well defined a meaning. But the user running pg_amcheck might also *know* that they aren't running any such DDL, and therefore expect --all to actually result in everything being checked.

I find it strange that I should do anything about #2 in pg_amcheck, since it's the function in verify_nbtree that phrases the situation as an error. But I suppose I can just ignore that and have it print as a notice. I'm genuinely not trying to give you grief here -- I simply don't like that pg_amcheck is adding commentary atop what the checking functions are doing. I see a clean division between what pg_amcheck is doing and what amcheck is doing, and this feels to me to put that on the wrong side of the divide. If refusing to check the index because it is not in the requisite state is a notice, then why wouldn't verify_nbtree raise it as one and return early rather than raising an error?

I also find it strange that #3 is being attributed to pg_amcheck's choice of how to call the checking function, because I can't think of any other function where we require the SQL caller to do anything like what you are requiring here in order to prevent deadlocks, and also because the docs for the functions don't say that a deadlock is possible, merely that the function may return with an error. I was totally content to get an error back, since errors are how the verify_nbtree functions communicate everything else, and the handler for those functions is already prepared to deal with the error messages so returned. But it clearly annoys you that pg_amcheck is doing this, so I'll go forward with the patch that I already have written which does otherwise.


Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Elvis Pranskevichus 2021-10-05 00:36:09 Re: BUG #17213: Wrong result from a query involving Merge Semi Join and Memoize
Previous Message David Rowley 2021-10-05 00:14:47 Re: BUG #17213: Wrong result from a query involving Merge Semi Join and Memoize

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2021-10-05 00:43:34 Re: plperl on windows
Previous Message Jaime Casanova 2021-10-05 00:30:34 Re: Patch: Range Merge Join