Re: pg_amcheck contrib application

From: Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Noah Misch <noah(at)leadboat(dot)com>, Peter Geoghegan <pg(at)bowt(dot)ie>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, "Andrey M(dot) Borodin" <x4mmm(at)yandex-team(dot)ru>, Stephen Frost <sfrost(at)snowman(dot)net>, Michael Paquier <michael(at)paquier(dot)xyz>, Amul Sul <sulamul(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: pg_amcheck contrib application
Date: 2021-04-20 00:07:58
Message-ID: D443E2DC-B8AD-48A4-99D2-77CEDCF84AA0@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> On Apr 19, 2021, at 12:50 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>
> On Thu, Apr 15, 2021 at 1:07 PM Mark Dilger
> <mark(dot)dilger(at)enterprisedb(dot)com> wrote:
>> I have added the verb "has" rather than "contains" because "has" is more consistent with the phrasing of other similar corruption reports.
>
> That makes sense.
>
> I think it's odd that a range of extraneous chunks is collapsed into a
> single report if the size of each chunk happens to be
> TOAST_MAX_CHUNK_SIZE and not otherwise. Why not just remember the
> first and last extraneous chunk and the size of each? If the next
> chunk you see is the next one in sequence and the same size as all the
> others, extend your notion of the sequence end by 1. Otherwise, report
> the range accumulated so far. It seems to me that this wouldn't be any
> more code than you have now, and might actually be less.

In all cases of uncorrupted toasted attributes, the sequence of N chunks that make up the attribute should be N-1 chunks of TOAST_MAX_CHUNK_SIZE ending with a single chunk of up to TOAST_MAX_CHUNK_SIZE. I'd like to refer to such sequences as "reasonably sized" sequences to make conversation easier.

If the toast pointer's va_extsize field leads us to believe that we should find 10 reasonably sized chunks, but instead we find 30 reasonably sized chunks, we know something is corrupt. We shouldn't automatically prejudice the user against the additional 20 chunks. We didn't expect them, but maybe that's because va_extsize was corrupt and gave us a false expectation. We're not pointing fingers one way or the other.

On the other hand, if we expect 10 chunks and find an additional 20 unreasonably sized chunks, we can and should point fingers at the extra 20 chunks. Even if we somehow knew that va_extsize was also corrupt, we'd still be justified in saying the 20 unreasonably sized chunks are each, individually corrupt.

I tried to write the code to report one corruption message per corruption found. There are some edge cases where this is a definitional challenge, so it's not easy to say that I've always achieved this goal, but I think i've done so where the definitions are clear. As such, the only time I'd want to combine toast chunks into a single corruption message is when they are not in themselves necessarily *individually* corrupt. That is why I wrote the code to use TOAST_MAX_CHUNK_SIZE rather than just storing up any series of equally sized chunks.

On a related note, when complaining about a sequence of toast chunks, often the sequence is something like [maximal, maximal, ..., maximal, partial], but sometimes it's just [maximal...maximal], sometimes just [maximal], and sometimes just [partial]. If I'm complaining about that entire sequence, I'd really like to do so in just one message, otherwise it looks like separate complaints.

I can certainly change the code to be how you are asking, but I'd first like to know that you really understood what I was doing here and why the reports read the way they do.

> I think that report_missing_chunks() could probably just report the
> range of missing chunks and not bother reporting how big they were
> expected to be. But, if it is going to report how big they were
> expected to be, I think it should have only 2 cases rather than 4:
> either a range of missing chunks of equal size, or a single missing
> chunk of some size. If, as I propose, it doesn't report the expected
> size, then you still have just 2 cases: a range of missing chunks, or
> a single missing chunk.

Right, this is the same as above. I'm trying not to split a single corruption complaint into separate reports.

> Somehow I have a hard time feeling confident that check_toast_tuple()
> is going to do the right thing. The logic is really complex and hard
> to understand. 'chunkno' is a counter that advances every time we move
> to the next chunk, and 'curchunk' is the value we actually find in the
> TOAST tuple. This terminology is not easy to understand. Most messages
> now report 'curchunk', but some still report 'chunkno'. Why does
> 'chunkno' need to exist at all? AFAICS the combination of 'curchunk'
> and 'tctx->last_chunk_seen' ought to be sufficient. I can see no
> particular reason why what you're calling 'chunkno' needs to exist
> even as a local variable, let alone be printed out. Either we haven't
> yet validated that the chunk_id extracted from the tuple is non-null
> and greater than the last chunk number we saw, in which case we can
> just complain about it if we find it to be otherwise, or we have
> already done that validation, in which case we should complain about
> that value and not 'chunkno' in any subsequent messages.

If we use tctx->last_chunk_seen as you propose, I imagine we'd set that to -1 prior to the first call to check_toast_tuple(). In the first call, we'd extract the toast chunk_seq and store it in curchunk and verify that it's one greater than tctx->last_chunk_seen. That all seems fine.

But under corrupt conditions, curchunk = DatumGetInt32(fastgetattr(toasttup, 2, hctx->toast_rel->rd_att, &isnull)) could return -1. That's invalid, of course, but now we don't know what to do. We're supposed to complain when we get the same chunk_seq from the index scan more than once in a row, but we don't know if the value in last_chunk_seen is a real value or just the dummy initial value. Worse still, when we get the next toast tuple back and it has a chunk_seq of -2, we want to complain that the index is returning tuples in reverse order, but we can't, because we still don't know if the -1 in last_chunk_seen is legitimate or a dummy value because that state information isn't carried over from the previous call.

Using chunkno solves this problem. If chunkno == 0, it means this is our first call, and tctx->last_chunk_seen is uninitialized. Otherwise, this is not the first call, and tctx->last_chunk_seen really is the chunk_seq seen in the prior call. There is no ambiguity.

I could probably change "int chunkno" to "bool is_first_call" or similar. I had previously used chunkno in the corruption report about chunks whose chunk_seq is null. The idea was that if you have 100 chunks and the 30th chunk is corruptly nulled out, you could say something like "toast value 178337 has toast chunk 30 with null sequence number", but you had me change that to "toast value 178337 has toast chunk with null sequence number", so generation of that message no longer needs the chunkno. I had kept chunkno around for the other purpose of knowing whether tctx->last_chunk_seen has been initialized yet, but a bool for that would now be sufficient. In any event, though you disagree with me about this below, I think the caller of this code still needs to track chunkno.

> The conditionals between where you set max_valid_prior_chunk and where
> you set last_chunk_seen seem hard to understand, particularly the
> bifurcated way that missing chunks are reported. Initial missing
> chunks are detected by (chunkno == 0 && max_valid_prior_chunk >= 0)
> and later missing chunks are detected by (chunkno > 0 &&
> max_valid_prior_chunk > tctx->last_chunk_seen). I'm not sure if this
> is correct;

When we read a chunk_seq from a toast tuple, we need to determine if it indicates a gap in the chunk sequence, but we need to be careful.

The (chunkno == 0) and (chunkno > 0) stuff is just distinguishing between the first call and all subsequent calls.

For illustrative purposes, imagine that we expect chunks [0..4].

On the first call, we expect chunk_seq = 0, but that's not what we actually complain about if we get chunk_seq = 15. We complain about all missing expected chunks, namely [0..4], not [0..14]. We also don't complain yet about seeing extraneous chunk 15, because it might be the first in a series of contiguous extraneous chunks, and we want to wait and report those all at once when the sequence finishes. Simply complaining at this point that we didn't expect to see chunk_seq 15 is the kind of behavior that we already have committed and are trying to fix because the corruption reports are not on point.

On subsequent calls, we expect chunk_seq = last_chunk_seen+1, but that's also not what we actually complain about if we get some other value for chunk_seq. What we complain about are the missing and extraneous sequences, not the individual chunk that had an unexpected value.

> I find it hard to get my head around what
> max_valid_prior_chunk is supposed to represent. But in any case I
> think it can be written more simply. Just keep track of what chunk_id
> we expect to extract from the next TOAST tuple. Initially it's 0.
> Then:
>
> if (chunkno < tctx->expected_chunkno)
> {
> // toast value %u index scan returned chunk number %d when chunk %d
> was expected
> // don't modify tctx->expected_chunkno here, just hope the next
> thing matches our previous expectation
> }
> else
> {
> if (chunkno > tctx->expected_chunkno)
> // chunks are missing from tctx->expected_chunkno through
> Min(chunkno - 1, tctx->final_expected_chunk), provided that the latter
> value is greater than or equal to the former
> tctx->expected_chunkno = chunkno + 1;
> }
>
> If you do this, you only need to report extraneous chunks when chunkno
>> tctx->final_expected_chunk, since chunkno < 0 is guaranteed to
> trigger the first of the two complaints shown above.

In the example above, if we're expecting chunks [0..4] and get chunk_seq = 5, the max_valid_prior_chunk is 4. If we instead get chunk_seq = 6, the max_valid_prior_chunk is still 4, because chunk 5 is out of bounds.

> In check_tuple_attribute I suggest "bool valid = false" rather than
> "bool invalid = true". I think it's easier to understand.

Yeah, I had it that way and changed it, because I don't much like having the only use of a boolean be a negation.

bool foo = false; ... if (!foo) { ... }

seems worse to me than

bool foo = true; ... if (foo) { ... }

But you're looking at it more from the perspective of english grammar, where "invalid = false" reads as a double-negative. That's fine. I can change it back.

> I object to check_toasted_attribute() using 'chunkno' in a message for
> the same reasons as above in regards to check_toast_tuple() i.e. I
> think it's a concept which should not exist.

So if we expect 100 chunks, get chunks [0..19, 80..99], you'd have me write the message as "expected 100 chunks but sequence ended at chunk 99"? I think that's odd. It makes infinitely more sense to me to say "expected 100 chunks but sequence ended at chunk 40". Actually, this is an argument against changing "int chunkno" to "bool is_first_call", as I alluded to above, because we have to keep the chunkno around anyway.

> I think this patch could possibly be split up into multiple patches.
> There's some question in my mind whether it's getting too late to
> commit any of this, since some of it looks suspiciously like new
> features after feature freeze. However, I kind of hate to ship this
> release without at least doing something about the chunkno vs.
> curchunk stuff, which is even worse in the committed code than in your
> patch, and which I think will confuse the heck out of users if those
> messages actually fire for anyone.

I'm in favor of cleaning up the committed code to have easier to understand output. I don't really agree with any of your proposed changes to my patch, though, which is I think a first.


Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 2021-04-20 00:09:17 Re: partial heap only tuples
Previous Message Thomas Munro 2021-04-20 00:05:27 Re: Bogus collation version recording in recordMultipleDependencies