Planner chose a much slower plan in hashjoin, using a large table as the inner table.

From: Jinbao Chen <jinchen(at)pivotal(dot)io>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Planner chose a much slower plan in hashjoin, using a large table as the inner table.
Date: 2019-11-18 06:48:17
Message-ID: CACLUs_hGEDPyciO6Y-js2tYVo7_qNcWajvLKBKEh=6VT+xvhsw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi Hackers,

The planner will use big table as inner table in hash join if small table
have fewer unique values.
But this plan is much slower than using small table as inner table. This
problem occurs on master
branch without parallel scan.

For example

create table t_small(a int);
create table t_big(b int);
insert into t_small select i%100 from generate_series(0, 3000);
insert into t_big select i%100000 from generate_series(1, 100000000)i ;
analyze t_small;
analyze t_big;
set max_parallel_workers_per_gather = 0;

and the plan made by planner is
demo2=# explain select * from t_small, t_big where a = b;
QUERY PLAN
-------------------------------------------------------------------------------
Hash Join (cost=3083104.72..3508073.65 rows=3045990 width=8)
Hash Cond: (t_small.a = t_big.b)
-> Seq Scan on t_small (cost=0.00..44.01 rows=3001 width=4)
-> Hash (cost=1442478.32..1442478.32 rows=100000032 width=4)
-> Seq Scan on t_big (cost=0.00..1442478.32 rows=100000032
width=4)

and it runs nearly 58s
demo2=# select * from t_small, t_big where a = b;
Time: 58544.525 ms (00:58.545)

But if we do some hack and use the small table as inner. It runs 19s.
demo2=# explain select * from t_small, t_big where a = b;
QUERY PLAN
-------------------------------------------------------------------------
Hash Join (cost=81.52..1723019.82 rows=3045990 width=8)
Hash Cond: (t_big.b = t_small.a)
-> Seq Scan on t_big (cost=0.00..1442478.32 rows=100000032 width=4)
-> Hash (cost=44.01..44.01 rows=3001 width=4)
-> Seq Scan on t_small (cost=0.00..44.01 rows=3001 width=4)

demo2=# select * from t_small, t_big where a = b;
Time: 18751.588 ms (00:18.752)

RCA:

The cost of the inner table mainly comes from creating a hash table.
startup_cost += (cpu_operator_cost * num_hashclauses + cpu_tuple_cost)
* inner_path_rows;

The cost of the outer table mainly comes from search the hash table.
Calculate the hash value:
run_cost += cpu_operator_cost * num_hashclauses * outer_path_rows;

Traverse the linked list in the bucket and compare:
run_cost += hash_qual_cost.per_tuple * outer_path_rows *
clamp_row_est(inner_path_rows * innerbucketsize) * 0.5;

In general, the cost of creating a hash table is higher than the cost of
querying a hash table.
So we tend to use small tables as internal tables. But if the average chain
length of the bucket
is large, the situation is just the opposite.

In the test case above, the small table has 3000 tuples and 100 distinct
values on column ‘a’.
If we use small table as inner table. The chan length of the bucket is 30.
And we need to
search the whole chain on probing the hash table. So the cost of probing is
bigger than build
hash table, and we need to use big table as inner.

But in fact this is not true. We initialized 620,000 buckets in hashtable.
But only 100 buckets
has chains with length 30. Other buckets are empty. Only hash values need
to be compared.
Its costs are very small. We have 100,000 distinct key and 100,000,000
tuple on outer table.
Only (100/100000)* tuple_num tuples will search the whole chain. The other
tuples
(number = (98900/100000)*tuple_num*) in outer
table just compare with the hash value. So the actual cost is much smaller
than the planner
calculated. This is the reason why using a small table as inner is faster.

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2019-11-18 08:52:44 Re: SegFault on 9.6.14
Previous Message Surafel Temesgen 2019-11-18 06:42:14 Re: Conflict handling for COPY FROM