Re: new heapcheck contrib module

From: Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Peter Geoghegan <pg(at)bowt(dot)ie>, "Andrey M(dot) Borodin" <x4mmm(at)yandex-team(dot)ru>, Stephen Frost <sfrost(at)snowman(dot)net>, Michael Paquier <michael(at)paquier(dot)xyz>, Amul Sul <sulamul(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: new heapcheck contrib module
Date: 2020-10-23 14:04:04
Message-ID: 30B8E99A-2D9C-48D4-A55C-741C9D5F1563@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> On Oct 22, 2020, at 9:21 PM, Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com> wrote:
>
>
>
>> On Oct 22, 2020, at 7:01 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>>
>> Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com> writes:
>>> Ahh, crud. It's because
>>> syswrite($fh, '\x77\x77\x77\x77', 500)
>>> is wrong twice. The 500 was wrong, but the string there isn't the bit pattern we want -- it's just a string literal with backslashes and such. It should have been double-quoted.
>>
>> Argh. So we really have, using same test except
>>
>> memcpy(&lp, "\\x77", sizeof(lp));
>>
>> little endian: off = 785c, flags = 2, len = 1b9b
>> big endian: off = 2e3c, flags = 0, len = 3737
>>
>> which explains the apparent LP_DEAD result.
>>
>> I'm not particularly on board with your suggestion of "well, if it works
>> sometimes then it's okay". Then we have no idea of what we really tested.
>>
>> regards, tom lane
>
> Ok, I've pruned it down to something you may like better. Instead of just checking that *some* corruption occurs, it checks the returned corruption against an expected regex, and if it fails to match, you should see in the logs what you got vs. what you expected.
>
> It only corrupts the first two line pointers, the first one with 0x77777777 and the second one with 0xAAAAAAAA, which are consciously chosen to be bitwise reverses of each other and just strings of alternating bits rather than anything that could have a more complicated interpretation.
>
> On my little-endian mac, the 0x77777777 value creates a line pointer which redirects to an invalid offset 0x7777, which gets reported as decimal 30583 in the corruption report, "line pointer redirection to item at offset 30583 exceeds maximum offset 38". The test is indifferent to whether the corruption it is looking for is reported relative to the first line pointer or the second one, so if endian-ness matters, it may be the 0xAAAAAAAA that results in that corruption message. I don't have a machine handy to test that. It would be nice to determine the minimum amount of paranoia necessary to make this portable and not commit the rest.

Obviously, that should have said 0x55555555 and 0xAAAAAAAA. After writing the patch that way, I checked that the old value 0x77777777 also works on my mac, which it does, and checked that writing the line pointers starting at offset 24 rather than 32 works on my mac, which it does, and then went on to write this rather confused email and attached the patch with those changes, which all work (at least on my mac) but are potentially less portable than what I had before testing those changes.

I apologize for any confusion my email from last night may have caused.

The patch I *should* have attached last night this time:

Attachment Content-Type Size
regress.patch.WIP.2 application/octet-stream 5.6 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Stephen Frost 2020-10-23 14:12:11 Re: [doc] remove reference to pg_dump pre-8.1 switch behaviour
Previous Message Jürgen Purtz 2020-10-23 13:58:46 Re: Additional Chapter for Tutorial