Re: pg11.5: ExecHashJoinNewBatch: glibc detected...double free or corruption (!prev)

From: Justin Pryzby <pryzby(at)telsasoft(dot)com>
To: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc: Peter Geoghegan <pg(at)bowt(dot)ie>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Merlin Moncure <mmoncure(at)gmail(dot)com>
Subject: Re: pg11.5: ExecHashJoinNewBatch: glibc detected...double free or corruption (!prev)
Date: 2019-08-26 01:44:14
Message-ID: 20190826014414.GC7201@telsasoft.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Aug 26, 2019 at 01:09:19PM +1200, Thomas Munro wrote:
> On Sun, Aug 25, 2019 at 3:15 PM Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
> > I was reminded of this issue from last year, which also appeared to
> > involve BufFileClose() and a double-free:
> >
> > https://postgr.es/m/87y3hmee19.fsf@news-spur.riddles.org.uk
> >
> > That was a BufFile that was under the control of a tuplestore, so it
> > was similar to but different from your case. I suspect it's related.
>
> Hmm. tuplestore.c follows the same coding pattern as nodeHashjoin.c:
> it always nukes its pointer after calling BufFileFlush(), so it
> shouldn't be capable of calling it twice for the same pointer, unless
> we have two copies of that pointer somehow.
>
> Merlin's reported a double-free apparently in ExecHashJoin(), not
> ExecHashJoinNewBatch() like this report. Unfortunately that tells us
> very little.
>
> On Sun, Aug 25, 2019 at 2:25 PM Justin Pryzby <pryzby(at)telsasoft(dot)com> wrote:
> > #4 0x00000039ff678dd0 in _int_free (av=0x39ff98e120, p=0x1d40b090, have_lock=0) at malloc.c:4846
> > #5 0x00000000006269e5 in ExecHashJoinNewBatch (pstate=0x2771218) at nodeHashjoin.c:1058
>
> Can you reproduce this or was it a one-off crash?

The query was of our large reports, and this job runs every 15min against
recently-loaded data; in the immediate case, between
2019-08-24t08:00:00 and 2019-08-24 09:00:00

I can rerun it fine, and I ran it in a loop for awhile last night with no
issues.

time psql ts -f tmp/sql-2019-08-24.1 |wc
5416 779356 9793941

Since it was asked in other thread Peter mentioned:

ts=# SHOW work_mem;
work_mem | 128MB

ts=# SHOW shared_buffers ;
shared_buffers | 1536MB

> might be some obscure path somewhere, possibly through a custom
> operator or suchlike, that leaves us in a strange memory context, or
> something like that? But then I feel like we'd have received
> reproducible reports and a test case by now.

No custom operator in sight. Just NATURAL JOIN on integers, and WHERE on
timestamp, some plpgsql and int[].

Justin

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Craig Ringer 2019-08-26 01:46:57 Re: The serial pseudotypes
Previous Message Thomas Munro 2019-08-26 01:09:19 Re: pg11.5: ExecHashJoinNewBatch: glibc detected...double free or corruption (!prev)