Re: pg_amcheck contrib application

From: Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Noah Misch <noah(at)leadboat(dot)com>, Peter Geoghegan <pg(at)bowt(dot)ie>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, "Andrey M(dot) Borodin" <x4mmm(at)yandex-team(dot)ru>, Stephen Frost <sfrost(at)snowman(dot)net>, Michael Paquier <michael(at)paquier(dot)xyz>, Amul Sul <sulamul(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: pg_amcheck contrib application
Date: 2021-04-08 21:21:28
Message-ID: EA725232-2D5E-46CF-AC37-B702E2286454@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> On Apr 8, 2021, at 1:05 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>
> On Thu, Apr 8, 2021 at 3:02 PM Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com> wrote:
>> Imagine a toasted attribute with 18 chunks numbered [0..17]. Then we update the toast to have only 6 chunks numbered [0..5] except we corruptly keep chunks numbered [12..17] in the toast table. We'd rather see a report like this:
> [ toast value NNN chunk NNN has sequence number NNN, but expected
> sequence number NNN ]
>> than one like this:
> [ toast value NNN contains chunk NNN where chunk NNN was expected ]
>>
>> because saying the toast value ended at "chunk 12" after saying that it contains "chunk 17" is contradictory. You need the distinction between the chunk number and the chunk sequence number, since in corrupt circumstances they may not be the same.
>
> Hmm, I see your point, and that's a good example to illustrate it.
> But, with that example in front of me, I am rather doubtful that
> either of these is what users actually want. Consider the case where I
> should have chunks 0..17 and chunk 1 is just plain gone. This, by the
> way, seems like a pretty likely case to arise in practice, since all
> we need is for a block to get truncated away or zeroed erroneously, or
> for a tuple to get pruned that shouldn't. With either of the above
> schemes, I guess we're going to get a message about every chunk from 2
> to 17, complaining that they're all misnumbered. We might also get a
> complaint that the last chunk is the wrong size, and that the total
> number of chunks isn't right. What we really want is a single
> complaint saying chunk 1 is missing.
>
> Likewise, in your example, I sort of feel like what I really want,
> rather than either of the above outputs, is to get some messages like
> this:
>
> toast value NNN contains unexpected extra chunk [12-17]
>
> Both your phrasing for those messages and what I suggested make it
> sound like the problem is that the chunk number is wrong. But that
> doesn't seem like it's taking the right view of the situation. Chunks
> 12-17 shouldn't exist at all, and if they do, we should say that, e.g.
> by complaining about something like "toast value 16444 chunk 12
> follows last expected chunk 5"
>
> In other words, I don't buy the idea that the user will accept the
> idea that there's a chunk number and a chunk sequence number, and that
> they should know the difference between those things and what each of
> them are. They're entitled to imagine that there's just one thing, and
> that we're going to tell them about value that are extra or missing.
> The fact that we're not doing that seems like it's just a matter of
> missing code.

Somehow, we have to get enough information about chunk_seq discontinuity into the output that if the user forwards it to -hackers, or if the output comes from a buildfarm critter, that we have all the information to help diagnose what went wrong.

As a specific example, if the va_rawsize suggests 2 chunks, and we find 150 chunks all with contiguous chunk_seq values, that is different from a debugging point of view than if we find 150 chunks with chunk_seq values spread all over the [0..MAXINT] range. We can't just tell the user that there were 148 extra chunks. We also shouldn't phrase the error in terms of "extra chunks", since it might be the va_rawsize that is corrupt.

I agree that the current message output might be overly verbose in how it reports this information. Conceptually, we want to store up information about the chunk issues and report them all at once, but that's hard to do in general, as the number of chunk_seq discontinuities might be quite large, much too large to fit reasonably into any one message. Maybe we could report just the first N discontinuities rather than all of them, but if somebody wants to open a hex editor and walk through the toast table, they won't appreciate having the corruption information truncated like that.

All this leads me to believe that we should report the following:

1) If the total number of chunks retrieved differs from the expected number, report how many we expected vs. how many we got
2) If the chunk_seq numbers are discontiguous, report each discontiguity.
3) If the index scan returned chunks out of chunk_seq order, report that
4) If any chunk is not the expected size, report that

So, for your example of chunk 1 missing from chunks [0..17], we'd report that we got one fewer chunks than we expected, that the second chunk seen was discontiguous from the first chunk seen, that the final chunk seen was smaller than expected by M bytes, and that the total size was smaller than we expected by N bytes. The third of those is somewhat misleading, since the final chunk was presumably the right size; we just weren't expecting to hit a partial chunk quite yet. But I don't see how to make that better in the general case.

> If we start the index scan and get chunk 4, we can
> easily emit messages for chunks 0..3 right on the spot, declaring them
> missing. Things do get a bit hairy if the index scan returns values
> out of order: what if it gives us chunk_seq = 2 and then chunk_seq =
> 1? But I think we could handle that by just issuing a complaint in any
> such case that "toast index returns chunks out of order for toast
> value NNN" and stopping further checking of that toast value.
>
>> Purely as manual testing, and not part of the patch, I hacked the backend a bit to allow direct modification of the toast table. After corrupting the toast with the following bit of SQL:
>>
>> WITH chunk_limit AS (
>> SELECT chunk_id, MAX(chunk_seq) AS maxseq
>> FROM $toastname
>> GROUP BY chunk_id)
>> INSERT INTO $toastname (chunk_id, chunk_seq, chunk_data)
>> (SELECT t.chunk_id,
>> t.chunk_seq + cl.maxseq + CASE WHEN t.chunk_seq < 3 THEN 1 ELSE 7 END,
>> t.chunk_data
>> FROM $toastname t
>> INNER JOIN chunk_limit cl
>> ON t.chunk_id = cl.chunk_id)
>>
>> pg_amcheck reports the following corruption messages:
>>
>> # heap table "postgres"."public"."test", block 0, offset 1, attribute 2:
>> # toast value 16444 chunk 6 follows last expected chunk 5
>> # heap table "postgres"."public"."test", block 0, offset 1, attribute 2:
>> # toast value 16444 chunk 7 follows last expected chunk 5
>> # heap table "postgres"."public"."test", block 0, offset 1, attribute 2:
>> # toast value 16444 chunk 8 follows last expected chunk 5
>> # heap table "postgres"."public"."test", block 0, offset 1, attribute 2:
>> # toast value 16444 chunk 9 has sequence number 15, but expected sequence number 9
>> # heap table "postgres"."public"."test", block 0, offset 1, attribute 2:
>> # toast value 16444 chunk 10 has sequence number 16, but expected sequence number 10
>> # heap table "postgres"."public"."test", block 0, offset 1, attribute 2:
>> # toast value 16444 chunk 11 has sequence number 17, but expected sequence number 11
>> # heap table "postgres"."public"."test", block 0, offset 1, attribute 2:
>> # toast value 16444 was expected to end at chunk 6, but ended at chunk 12
>>
>> I think if we'd left out the first three messages, it would read strangely. We would be complaining about three chunks with the wrong sequence number, then conclude that there were six extra chunks. A sufficiently savvy user might deduce the presence of chunks 6, 7, and 8, but the problem is more obvious (to my eyes, at least) if we keep the first three messages. This seems like a judgement call and not a clear argument either way, so if you still want me to change it, I guess I don't mind doing so.
>
> I mean, looking at it, the question here is why it's not just using
> the same message for all of them. The fact that the chunk numbers are
> higher than 5 is the problem. The sequence numbers seem like just a
> distraction.

Again, I don't think we can reach that conclusion. You are biasing the corruption reports in favor of believing the va_rawsize rather than believing the toast table.

>>> On a related note, as I think I said before, I still think we should
>>> be rejiggering this so that we're not testing both the size of each
>>> individual chunk and the total size, because that ought to be
>>> redundant. That might be better done as a separate patch but I think
>>> we should try to clean it up.
>>
>> Can you point me to the exact check you are mentioning, and with which patch applied? I don't see any examples of this after applying the v18-0003.
>
> Hmm, my mistake, I think.
>
>>> I'm a little worried about whether the additional test cases are
>>> Endian-dependent at all. I don't immediately know what might be wrong
>>> with them, but I'm going to think about that some more later. Any
>>> chance you have access to a Big-endian box where you can test this?
>>
>> I don't have a Big-endian box, but I think one of them may be wrong now that you mention the issue:
>>
>> # Corrupt column c's toast pointer va_extinfo field
>>
>> The problem is that the 30-bit extsize and 2-bit cmid split is not being handled in the perl test, and I don't see an easy way to have perl's pack/unpack do that for us. There isn't any requirement that each possible corruption we check actually be manifested in the regression tests. The simplest solution is to remove this problematic test, so that's what I did. The other two new tests corrupt c_va_toastrelid and c_va_rawsize, both of which are read/written using unpack/pack, so perl should handle the endianness for us (I hope).
>
> I don't immediately see why this particular thing should be an issue.
> The format of the varlena header itself is different on big-endian and
> little-endian machines, which is why postgres.h has all this stuff
> conditioned on WORDS_BIGENDIAN. But va_extinfo doesn't have any
> similar treatment, so I'm not sure what could go wrong there, as long
> as the 4-byte value as a whole is being packed and unpacked according
> to the machine's endian-ness.

Good point. Perhaps the test was ok after all.


Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2021-04-08 21:29:51 Re: pgsql: autovacuum: handle analyze for partitioned tables
Previous Message Tom Lane 2021-04-08 20:57:59 Re: Dubious coding in nbtinsert.c