Re: BUG #17619: AllocSizeIsValid violation in parallel hash join

From: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To: Peter Geoghegan <pg(at)bowt(dot)ie>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Dmitry Astapov <dastapov(at)gmail(dot)com>, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #17619: AllocSizeIsValid violation in parallel hash join
Date: 2022-09-27 19:15:19
Message-ID: CA+hUKGJV54w8jVqdBcpP7LaCL8PhcEhT97-nfrTcD2rdKCcteA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On Wed, Sep 28, 2022 at 7:33 AM Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
> On Tue, Sep 27, 2022 at 9:44 AM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> > Right, the missing piece is the intentional clobber.
>
> That does seem like the best place to start. The attached patch adds
> clobbering that works exactly as you'd expect. This approach is
> obviously correct. It also doesn't require any reasoning about
> Valgrind's treatment of memory mappings for shared memory, which is
> quite complicated given the inconsistent rules about who initializes
> what memory (if it's leader or workers).
>
> I find that the tests pass with this patch -- so it probably won't
> catch the bug that Thomas mentioned via running the tests (at least
> not reliably). However, if I revert parallel VACUUM bugfix commit
> 662ba729 and then run the tests, they fail very reliably, in several
> places. That seems like a big improvement.

The reason it doesn't catch that bug on master is because that npages
shmem variable is only used to prevent further reading once a scan
hits the end of a shared tuplestore chunk and needs to decide whether
to read a new one, but if a chunk is partially filled then we end the
scan sooner because there's a number-of-items counter in the chunk
header. I noticed because the test module I wrote to study Dmitry's
report fills chunks exactly to the end, so I assume the clobber patch
+ that test module patch would reveal the problem.

I was assuming it didn't break the case you mentioned because that's
just stats counters (maybe those finish up wrong but that's probably
not a failure), but now it sounds like you've seen another reason.

> I believe that Thomas was going to do something like this anyway. I'm
> happy to leave it up to him, but I can pursue this separately if that
> makes sense.

Why not clobber "lower down" in dsm_create(), as I showed? You don't
have to use the table-of-contents mechanism to use DSM memory.

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Peter Geoghegan 2022-09-27 21:16:41 Re: BUG #17619: AllocSizeIsValid violation in parallel hash join
Previous Message Peter Geoghegan 2022-09-27 18:32:53 Re: BUG #17619: AllocSizeIsValid violation in parallel hash join