Re: BUG #17619: AllocSizeIsValid violation in parallel hash join

From: Dmitry Astapov <dastapov(at)gmail(dot)com>
To: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc: pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #17619: AllocSizeIsValid violation in parallel hash join
Date: 2022-09-22 12:51:27
Message-ID: CAFQUnFiiPUOWihzjH7gevyaGGnX3tskr_3YFtNhv-AzpzY-jrg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On Thu, Sep 22, 2022 at 9:44 AM Thomas Munro <thomas(dot)munro(at)gmail(dot)com> wrote:

> On Thu, Sep 22, 2022 at 7:46 PM PG Bug reporting form
> <noreply(at)postgresql(dot)org> wrote:
> > (gdb) p size
> > $2 = 1702125924
>
> Thanks for the detailed report. Hmm. That size, on a little-endian
> system, is equivalent to the byte sequence "date\0\0\0\0", which looks
> pretty suspiciously like the inside of a tuple, and not its size. We
> must have got out of sync somehow.
>

Great catch! I added some debugging output to the BufFileRead and
sts_read_tuple and it looks to me that the sequence of events immediately
before the problem happens is like this:

1. sts_read_tuple is called multiple times to read from a particular
file, and a bunch of tuples are read this way
2. sts_read_tuple is called once more. In my experiments, I always had
read_ntuples_available - 2 == read_ntuples at this point, but maybe this is
just a coincidence.
1. inside this sts_read_tuple_call, BufFileRead is called to read
metadata, and succeeds (and this chunk of the file looks -- to
my naive eye
-- exactly like any other tuple start)
2. Second BufFileRead call happens to get the size, and reads the
correct size (I see what's in the file, I could find corresponding values
in the database tables, and it all lines up).
3. Third BufFileRead happens to read the actual tuple. So far so
good, and sts_read_tuple finishes.
3. At this point, the position in the file (buf_file -> currOffset +
post) points right after the last tuple read, but what's there does not
look like the start of the next tuple to me. It seems like there was
another, longer tuple written there before (with quite long rhs.payload
value) and somehow the tuple that was just read was written over it, so we
have "<tuple read on step 2 above><current read position here><what looks
like leftovers from another tuple><a bunch of zeroes><new tuple start>"
4. So another sts_read_tuple call happens, and it reads 4 bytes of this
garbage as metadata, and the next 4 bytes as length, and this is when
everything derails...

I also double-checked at the logs of the (test) database where this issue
was first spotted (and where a lot of queries similar to the one I am using
to trigger this issue are happening) and found out that along with "invalid
memory alloc request size" we also got a couple of occasions of "Unexpected
chunk in shared tuplestore temporary file. Expected overflow chunk", which
looks like another error that could be potentially explained by
BufFileRead's inside sts_read_tuple reading "garbage".

I'm happy to build a patched version of 13.8 (or another revision if you
want me to) to collect more debug information for you if this could help.

--
D. Astapov

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message PG Bug reporting form 2022-09-22 15:52:07 BUG #17620: JIT-Compiler causes Segmentation Fault
Previous Message Thomas Munro 2022-09-22 08:44:01 Re: BUG #17619: AllocSizeIsValid violation in parallel hash join