Re: infinite loop in parallel hash joins / DSA / get_best_segment

From: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: infinite loop in parallel hash joins / DSA / get_best_segment
Date: 2018-09-16 22:42:34
Message-ID: CAEepm=2R24dengvkjWw7a=c2pDvEkAXSH5q0=nrFZpw1gkj50Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Sep 17, 2018 at 10:38 AM Tomas Vondra
<tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
> While performing some benchmarks on REL_11_STABLE (at 444455c2d9), I've
> repeatedly hit an apparent infinite loop on TPC-H query 4. I don't know
> what exactly are the triggering conditions, but the symptoms are these:
>
> 1) A parallel worker" process is consuming 100% CPU, with per for
> reporting profile like this:
>
> 34.66% postgres [.] get_segment_by_index
> 29.44% postgres [.] get_best_segment
> 29.22% postgres [.] unlink_segment.isra.2
> 6.66% postgres [.] fls
> 0.02% [unknown] [k] 0xffffffffb10014b0
>
> So all the time seems to be spent within get_best_segment.
>
> 2) The backtrace looks like this (full backtrace attached):
>
> #0 0x0000561a748c4f89 in get_segment_by_index
> #1 0x0000561a748c5653 in get_best_segment
> #2 0x0000561a748c67a9 in dsa_allocate_extended
> #3 0x0000561a7466ddb4 in ExecParallelHashTupleAlloc
> #4 0x0000561a7466e00a in ExecParallelHashTableInsertCurrentBatch
> #5 0x0000561a7466fe00 in ExecParallelHashJoinNewBatch
> #6 ExecHashJoinImpl
> #7 ExecParallelHashJoin
> #8 ExecProcNode
> ...
>
> 3) The infinite loop seems to be pretty obvious - after setting
> breakpoint on get_segment_by_index we get this:
>
> Breakpoint 1, get_segment_by_index (area=0x560c03626e58, index=3) ...
> (gdb) c
> Continuing.
>
> Breakpoint 1, get_segment_by_index (area=0x560c03626e58, index=3) ...
> (gdb) c
> Continuing.
>
> Breakpoint 1, get_segment_by_index (area=0x560c03626e58, index=3) ...
> (gdb) c
> Continuing.
>
> That is, we call the function with the same index over and over.
>
> Why is that? Well:
>
> (gdb) print *area->segment_maps[3].header
> $1 = {magic = 216163851, usable_pages = 512, size = 2105344, prev = 3,
> next = 3, bin = 0, freed = false}
>
> So, we loop forever.
>
> I don't know what exactly are the triggering conditions here. I've only
> ever observed the issue on TPC-H with scale 16GB, partitioned lineitem
> table and work_mem set to 8MB and query #4. And it seems I can reproduce
> it pretty reliably.

Urgh. Thanks Tomas. I will investigate.

--
Thomas Munro
http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2018-09-16 23:03:12 More deficiencies in outfuncs/readfuncs processing
Previous Message Tomas Vondra 2018-09-16 22:38:10 infinite loop in parallel hash joins / DSA / get_best_segment