Re: TPC-H Q20 from 1 hour to 19 hours!

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>, Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>
Cc: PostgreSQL Developers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: TPC-H Q20 from 1 hour to 19 hours!
Date: 2017-03-29 22:14:30
Message-ID: 060a9d76-67ab-0152-ff2e-36bf9156fd78@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 03/29/2017 09:00 PM, Robert Haas wrote:
> On Mon, Mar 6, 2017 at 1:22 AM, Rafia Sabih
> <rafia(dot)sabih(at)enterprisedb(dot)com> wrote:
>> This is to bring to notice a peculiar instance I found recently while
>> running TPC-H benchmark queries. Q20 of the benchmark took 19 hours to
>> complete ...
>
> That's bad.
>
>> It is clear that selectivity estimations are really bad in this case
>> particularly at node,
>> -> Merge Join (cost=52959586.72..60024468.82 rows=85 width=16)
>> (actual time=1525322.753..2419045.641 rows=1696742 loops=1)
>> Merge Cond: ((lineitem.l_partkey =
>> partsupp.ps_partkey) AND (lineitem.l_suppkey = partsupp.ps_suppkey))
>> Join Filter:
>> ((partsupp.ps_availqty)::numeric > ((0.5 * sum(lineitem.l_quantity))))
>> Rows Removed by Join Filter: 3771
>
> So, the selectivity estimation here is bad both before and after Tom's
> commit, but it's significantly worse after (actual value 1696742, old
> estimate 3771, new estimate 85).
>
>> Still this puzzled me as during earlier runs of this benchmark I never
>> encountered such prolonged running times. On further investigation I
>> found that on reverting the commit
>> 7fa93eec4e0c9c3e801e3c51aa4bae3a38aaa218
>> Author: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
>> Date: Sat Dec 17 15:28:54 2016 -0500
>> Fix FK-based join selectivity estimation for semi/antijoins.
>
> I don't think the problem originates at the Merge Join, though,
> because the commit says that at is fixing semi and anti-join estimates
> - this is a plain inner join, so in theory it shouldn't change.
> However, it's a bit hard for me to piece through these plans, the
> formatting kind of got messed up - things are wrapped. Could you
> possibly attach the plans as attachments?
>

I've been looking into this today, and it seems to me the simplest query
triggering this issue (essentially a part of q20) is this:

select
ps_suppkey
from
partsupp,
(
select
l_partkey agg_partkey,
l_suppkey agg_suppkey
from
lineitem
group by
l_partkey,
l_suppkey
) agg_lineitem
where
agg_partkey = ps_partkey
and agg_suppkey = ps_suppkey
and ps_partkey in (
select
p_partkey
from
part
where
p_name like 'hot%'
);

which does actually include a semijoin. What seems to trigger the issue
is the join to the aggregated lineitem table - when replacing it with a
plain table, everything seems to be estimated perfectly fine.

Attached is a simple SQL script, that runs three variants of the query:

(1) with the join to the aggregated lineitem table
(2) with a join to a plain lineitem table
(3) with a join to a plain lineitem table and to 'part' (without the
semijoin)

First the queries are executed on tables without any foreign keys
(between those three), then with a FK between lineitem and partsupp, and
finally with additional FK between partsupp and part.

Initially the estimates are bad, but once the first foreign key is
added, the estimates get very accurate - except for the case (1).

I've only ran the queries on 10GB data set, but that should be enough.
The plans are from current master - I'll rerun the script on an older
release later today.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment Content-Type Size
q20.sql application/sql 3.9 KB
q20.log text/x-log 27.9 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Petr Jelinek 2017-03-29 23:01:38 Re: logical replication access control patches
Previous Message Jesper Pedersen 2017-03-29 21:06:13 Re: Page Scan Mode in Hash Index