infinite loop in parallel hash joins / DSA / get_best_segment

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: infinite loop in parallel hash joins / DSA / get_best_segment
Date: 2018-09-16 22:38:10
Message-ID: 194c0706-c65b-7d81-ab32-2c248c3e2344@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

While performing some benchmarks on REL_11_STABLE (at 444455c2d9), I've
repeatedly hit an apparent infinite loop on TPC-H query 4. I don't know
what exactly are the triggering conditions, but the symptoms are these:

1) A parallel worker" process is consuming 100% CPU, with per for
reporting profile like this:

34.66% postgres [.] get_segment_by_index
29.44% postgres [.] get_best_segment
29.22% postgres [.] unlink_segment.isra.2
6.66% postgres [.] fls
0.02% [unknown] [k] 0xffffffffb10014b0

So all the time seems to be spent within get_best_segment.

2) The backtrace looks like this (full backtrace attached):

#0 0x0000561a748c4f89 in get_segment_by_index
#1 0x0000561a748c5653 in get_best_segment
#2 0x0000561a748c67a9 in dsa_allocate_extended
#3 0x0000561a7466ddb4 in ExecParallelHashTupleAlloc
#4 0x0000561a7466e00a in ExecParallelHashTableInsertCurrentBatch
#5 0x0000561a7466fe00 in ExecParallelHashJoinNewBatch
#6 ExecHashJoinImpl
#7 ExecParallelHashJoin
#8 ExecProcNode
...

3) The infinite loop seems to be pretty obvious - after setting
breakpoint on get_segment_by_index we get this:

Breakpoint 1, get_segment_by_index (area=0x560c03626e58, index=3) ...
(gdb) c
Continuing.

Breakpoint 1, get_segment_by_index (area=0x560c03626e58, index=3) ...
(gdb) c
Continuing.

Breakpoint 1, get_segment_by_index (area=0x560c03626e58, index=3) ...
(gdb) c
Continuing.

That is, we call the function with the same index over and over.

Why is that? Well:

(gdb) print *area->segment_maps[3].header
$1 = {magic = 216163851, usable_pages = 512, size = 2105344, prev = 3,
next = 3, bin = 0, freed = false}

So, we loop forever.

I don't know what exactly are the triggering conditions here. I've only
ever observed the issue on TPC-H with scale 16GB, partitioned lineitem
table and work_mem set to 8MB and query #4. And it seems I can reproduce
it pretty reliably.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment Content-Type Size
explain.log text/x-log 19.6 KB
backtrace.txt text/plain 8.4 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Thomas Munro 2018-09-16 22:42:34 Re: infinite loop in parallel hash joins / DSA / get_best_segment
Previous Message Thomas Munro 2018-09-16 22:23:35 Re: Collation versioning