Re: pg11.5: ExecHashJoinNewBatch: glibc detected...double free or corruption (!prev)

From: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To: Peter Geoghegan <pg(at)bowt(dot)ie>
Cc: Justin Pryzby <pryzby(at)telsasoft(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Merlin Moncure <mmoncure(at)gmail(dot)com>
Subject: Re: pg11.5: ExecHashJoinNewBatch: glibc detected...double free or corruption (!prev)
Date: 2019-08-26 01:09:19
Message-ID: CA+hUKGJarB7fmBTZvvsBddN+XXDkKGBQ-yhpcum5Kk1M3dzY1g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sun, Aug 25, 2019 at 3:15 PM Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
> I was reminded of this issue from last year, which also appeared to
> involve BufFileClose() and a double-free:
>
> https://postgr.es/m/87y3hmee19.fsf@news-spur.riddles.org.uk
>
> That was a BufFile that was under the control of a tuplestore, so it
> was similar to but different from your case. I suspect it's related.

Hmm. tuplestore.c follows the same coding pattern as nodeHashjoin.c:
it always nukes its pointer after calling BufFileFlush(), so it
shouldn't be capable of calling it twice for the same pointer, unless
we have two copies of that pointer somehow.

Merlin's reported a double-free apparently in ExecHashJoin(), not
ExecHashJoinNewBatch() like this report. Unfortunately that tells us
very little.

On Sun, Aug 25, 2019 at 2:25 PM Justin Pryzby <pryzby(at)telsasoft(dot)com> wrote:
> #4 0x00000039ff678dd0 in _int_free (av=0x39ff98e120, p=0x1d40b090, have_lock=0) at malloc.c:4846
> #5 0x00000000006269e5 in ExecHashJoinNewBatch (pstate=0x2771218) at nodeHashjoin.c:1058

Can you reproduce this or was it a one-off crash?

Hmm. We don't have enough stack frames to know where in here, and
presumably aset.c, this is, but as you noted it's got to be somewhere
here:

void
BufFileClose(BufFile *file)
{
int i;

/* flush any unwritten data */
BufFileFlush(file);
/* close and delete the underlying file(s) */
for (i = 0; i < file->numFiles; i++)
FileClose(file->files[i]);
/* release the buffer space */
pfree(file->files);
pfree(file->offsets);
pfree(file);
}

BufFileFlush() and FileClose() don't seem to be able to reach free().
pfree() always reaches free() for sizes > allocChunkLimit (8KB in
ExecutorState). Given numFiles = 1 (as you showed), I expect
file->files and file->offsets to be small allocations, and file itself
to be large due to the 8KB buffer inside it.

Some possibilities:

1. Somehow we actually called BufFileClose() twice. Hard to see how
we'd do that, as mentioned.

2. Somehow the BufFile was created in the wrong memory context, and
the memory was freed earlier. Note that ExecHashJoinSaveTuple()'s
comment requires you to call it with CurrentMemoryContext == the
executor context, but nothing in the code enforces that. As for the
tuplestore.c case, note also that tuplestore.c explicitly sets
CurrentResourceOwner, but not CurrentMemoryContext. I suppose there
might be some obscure path somewhere, possibly through a custom
operator or suchlike, that leaves us in a strange memory context, or
something like that? But then I feel like we'd have received
reproducible reports and a test case by now.

3. Random memory corruption caused by buffer overrun who-knows-where.

> glibc-2.12-1.192.el6.x86_64
> linux 2.6.32-754.3.5.el6.x86_64

Greetings, time traveller! I see that the tuplestore.c report was
also on a system of that vintage. Hmm.

> #10 ExecSort (pstate=0x2771108) at nodeSort.c:107
> plannode = <value optimized out>
> outerNode = 0x2771218
> tupDesc = <value optimized out>
> node = 0x2771108
> estate = 0x2770a40
> dir = ForwardScanDirection
> tuplesortstate = 0x3c87160
> slot = <value optimized out>

It's interesting that a sort was involved here and it owns a
tuplestore, but hard to see the connection.

> (gdb) p *innerFile
> $2 = {numFiles = 1, files = 0xa421328, offsets = 0xa421310, isInterXact = false, dirty = false, readOnly = false, fileset = 0x0, name = 0x0, resowner = 0x24f93e0, curFile = 0, curOffset = 73016512,
>
> pos = 0, nbytes = 0, buffer = {
> data = '\000' <repeats 44 times>, "Q\366\262h\220\004\000\000\000\000\000\000\000\000L\000\003\000(\377\377\377\377\377\177\372\377\377\017\000\000\000\000\000\000\000\000\257\321\345\333\063\002\000\003", '\000' <repeats 23 times>"\204, \003\000\000\000\000\000\000K\000\000\000\000\000\000\000K\000\000\000\000\000\000\000K\000\000\000\000\000\000\000K\000\000\000\000\000\000\000b\000\000\000\000\000\000\000c\000\000\000\000\000\000\000c\000\000\000\000\000\000\000d", '\000' <repeats 15 times>, "q\002\000\000\000\000\000\000d", '\000' <repeats 23 times>, "I\000\000\000\000\000\000\000\344H\a\000\000\000\000\000\017\000\210\026\000\310\024\000\024H\000\000\000\000\000\000\017\000\210\035\000H!\027\000\210\216\000T\vB\017\304\t\027\000\210\a\000\206\b\033\030.\"", '\000' <repeats 11 times>..., force_align_d = 0, force_align_i64 = 0}}
>
> (gdb) p innerFile->files[0]
> $8 = 2397

Hmm. That all looks pretty sane from here, but tells us nothing about
whether it was already freed and if so where.

--
Thomas Munro
https://enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Justin Pryzby 2019-08-26 01:44:14 Re: pg11.5: ExecHashJoinNewBatch: glibc detected...double free or corruption (!prev)
Previous Message Peter Geoghegan 2019-08-25 23:19:10 Re: Building infrastructure for B-Tree deduplication that recognizes when opclass equality is also equivalence