Re: Planner chose a much slower plan in hashjoin, using a large table as the inner table.

From: Jinbao Chen <jinchen(at)pivotal(dot)io>
To: Andy Fan <zhihui(dot)fan1213(at)gmail(dot)com>
Cc: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Planner chose a much slower plan in hashjoin, using a large table as the inner table.
Date: 2019-11-28 11:18:57
Message-ID: CACLUs_g4xHt3OijSAZWnyqnbsyYnCGRc-Zy6=M-uAD7J7+YJvg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi Andy,

I just test the query on 12.1. But pg use big_table as inner.

demo=# explain (costs off) select * from t_small, t_big where a = b;
QUERY PLAN
------------------------------------
Hash Join
Hash Cond: (t_small.a = t_big.b)
-> Seq Scan on t_small
-> Hash
-> Seq Scan on t_big

Do you insert data and set max_parallel_workers_per_gather to 0 like above?

create table t_small(a int);
create table t_big(b int);
insert into t_small select i%100 from generate_series(0, 3000)i;
insert into t_big select i%100000 from generate_series(1, 100000000)i ;
analyze t_small;
analyze t_big;
set max_parallel_workers_per_gather = 0;

On Thu, Nov 28, 2019 at 5:46 PM Andy Fan <zhihui(dot)fan1213(at)gmail(dot)com> wrote:

>
>
> On Fri, Nov 22, 2019 at 6:51 PM Jinbao Chen <jinchen(at)pivotal(dot)io> wrote:
>
>> Hi hackers,
>>
>> I have made a patch to fix the problem.
>>
>> Added the selection rate of the inner table non-empty bucket
>>
>> The planner will use big table as inner table in hash join
>> if small table have fewer unique values. But this plan is
>> much slower than using small table as inner table.
>>
>> In general, the cost of creating a hash table is higher
>> than the cost of querying a hash table. So we tend to use
>> small tables as internal tables. But if the average chain
>> length of the bucket is large, the situation is just the
>> opposite.
>>
>> If virtualbuckets is much larger than innerndistinct, and
>> outerndistinct is much larger than innerndistinct. Then most
>> tuples of the outer table will match the empty bucket. So when
>> we calculate the cost of traversing the bucket, we need to
>> ignore the tuple matching empty bucket.
>>
>> So we add the selection rate of the inner table non-empty
>> bucket. The formula is:
>> (1 - ((outerndistinct - innerndistinct)/outerndistinct)*
>> ((virtualbuckets - innerndistinct)/virtualbuckets))
>>
>>
>> On Tue, Nov 19, 2019 at 5:56 PM Jinbao Chen <jinchen(at)pivotal(dot)io> wrote:
>>
>>> I think we have the same understanding of this issue.
>>>
>>> Sometimes use smaller costs on scanning the chain in bucket like below
>>> would
>>> be better.
>>> run_cost += outer_path_rows * some_small_probe_cost;
>>> run_cost += hash_qual_cost.per_tuple * approximate_tuple_count();
>>> In some version of GreenPlum(a database based on postgres), we just
>>> disabled
>>> the cost on scanning the bucket chain. In most cases, this can get a
>>> better query
>>> plan. But I am worried that it will be worse in some cases.
>>>
>>> Now only the small table's distinct value is much smaller than the
>>> bucket number,
>>> and much smaller than the distinct value of the large table, the planner
>>> will get the
>>> wrong plan.
>>>
>>> For example, if inner table has 100 distinct values, and 3000 rows. Hash
>>> table
>>> has 1000 buckets. Outer table has 10000 distinct values.
>>> We can assume that all the 100 distinct values of the inner table are
>>> included in the
>>> 10000 distinct values of the outer table. So (100/10000)*outer_rows
>>> tuples will
>>> probe the buckets has chain. And (9900/10000)*outer_rows tuples will
>>> probe
>>> all the 1000 buckets randomly. So (9900/10000)*outer_rows*(900/1000)
>>> tuples will
>>> probe empty buckets. So the costs on scanning bucket chain is
>>>
>>> hash_qual_cost.per_tuple*innerbucketsize*outer_rows*
>>> (1 - ((outer_distinct - inner_distinct)/outer_distinct)*((buckets_num -
>>> inner_disttinct)/buckets_num))
>>>
>>> Do you think this assumption is reasonable?
>>>
>>>
>>> On Tue, Nov 19, 2019 at 3:46 PM Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
>>> wrote:
>>>
>>>> On Mon, Nov 18, 2019 at 7:48 PM Jinbao Chen <jinchen(at)pivotal(dot)io> wrote:
>>>> > In the test case above, the small table has 3000 tuples and 100
>>>> distinct values on column ‘a’.
>>>> > If we use small table as inner table. The chan length of the bucket
>>>> is 30. And we need to
>>>> > search the whole chain on probing the hash table. So the cost of
>>>> probing is bigger than build
>>>> > hash table, and we need to use big table as inner.
>>>> >
>>>> > But in fact this is not true. We initialized 620,000 buckets in
>>>> hashtable. But only 100 buckets
>>>> > has chains with length 30. Other buckets are empty. Only hash values
>>>> need to be compared.
>>>> > Its costs are very small. We have 100,000 distinct key and
>>>> 100,000,000 tuple on outer table.
>>>> > Only (100/100000)* tuple_num tuples will search the whole chain. The
>>>> other tuples
>>>> > (number = (98900/100000)*tuple_num*) in outer
>>>> > table just compare with the hash value. So the actual cost is much
>>>> smaller than the planner
>>>> > calculated. This is the reason why using a small table as inner is
>>>> faster.
>>>>
>>>> So basically we think that if t_big is on the outer side, we'll do
>>>> 100,000,000 probes and each one is going to scan a t_small bucket with
>>>> chain length 30, so that looks really expensive. Actually only a
>>>> small percentage of its probes find tuples with the right hash value,
>>>> but final_cost_hash_join() doesn't know that. So we hash t_big
>>>> instead, which we estimated pretty well and it finishes up with
>>>> buckets of length 1,000 (which is actually fine in this case, they're
>>>> not unwanted hash collisions, they're duplicate keys that we need to
>>>> emit) and we probe them 3,000 times (which is also fine in this case),
>>>> but we had to do a bunch of memory allocation and/or batch file IO and
>>>> that turns out to be slower.
>>>>
>>>> I am not at all sure about this but I wonder if it would be better to
>>>> use something like:
>>>>
>>>> run_cost += outer_path_rows * some_small_probe_cost;
>>>> run_cost += hash_qual_cost.per_tuple * approximate_tuple_count();
>>>>
>>>> If we can estimate how many tuples will actually match accurately,
>>>> that should also be the number of times we have to run the quals,
>>>> since we don't usually expect hash collisions (bucket collisions, yes,
>>>> but hash collisions where the key doesn't turn out to be equal, no*).
>>>>
>>>> * ... but also yes as you approach various limits, so you could also
>>>> factor in bucket chain length that is due to being prevented from
>>>> expanding the number of buckets by arbitrary constraints, and perhaps
>>>> also birthday_problem(hash size, key space) to factor in unwanted hash
>>>> collisions that start to matter once you get to billions of keys and
>>>> expect collisions with short hashes.
>>>>
>>>
> FYI: I tried this on 12.1, and find it use small_table as inner table
> already. I didn't look into the details so far.
>
> postgres=# explain (costs off) select * from join_hash_t_small,
> join_hash_t_big where a = b;
> QUERY PLAN
> --------------------------------------------------------
> Hash Join
> Hash Cond: (join_hash_t_big.b = join_hash_t_small.a)
> -> Seq Scan on join_hash_t_big
> -> Hash
> -> Seq Scan on join_hash_t_small
> (5 rows)
>
> postgres=# select version();
> version
>
> -----------------------------------------------------------------------------------------------------------------
> PostgreSQL 12.1 on x86_64-apple-darwin18.7.0, compiled by Apple LLVM
> version 10.0.1 (clang-1001.0.46.4), 64-bit
> (1 row)
>

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Kyotaro Horiguchi 2019-11-28 11:56:20 Re: [HACKERS] WAL logging problem in 9.4.3?
Previous Message Pengzhou Tang 2019-11-28 11:07:22 Re: Parallel grouping sets