Re: Planner chose a much slower plan in hashjoin, using a large table as the inner table.

From: Jinbao Chen <jinchen(at)pivotal(dot)io>
To: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Planner chose a much slower plan in hashjoin, using a large table as the inner table.
Date: 2019-11-19 09:56:06
Message-ID: CACLUs_gxrFegadpOC8TuZxRb6+bJsNeTr3thWcKkq0Onk4JcQQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I think we have the same understanding of this issue.

Sometimes use smaller costs on scanning the chain in bucket like below would
be better.
run_cost += outer_path_rows * some_small_probe_cost;
run_cost += hash_qual_cost.per_tuple * approximate_tuple_count();
In some version of GreenPlum(a database based on postgres), we just disabled
the cost on scanning the bucket chain. In most cases, this can get a better
query
plan. But I am worried that it will be worse in some cases.

Now only the small table's distinct value is much smaller than the bucket
number,
and much smaller than the distinct value of the large table, the planner
will get the
wrong plan.

For example, if inner table has 100 distinct values, and 3000 rows. Hash
table
has 1000 buckets. Outer table has 10000 distinct values.
We can assume that all the 100 distinct values of the inner table are
included in the
10000 distinct values of the outer table. So (100/10000)*outer_rows tuples
will
probe the buckets has chain. And (9900/10000)*outer_rows tuples will probe
all the 1000 buckets randomly. So (9900/10000)*outer_rows*(900/1000) tuples
will
probe empty buckets. So the costs on scanning bucket chain is

hash_qual_cost.per_tuple*innerbucketsize*outer_rows*
(1 - ((outer_distinct - inner_distinct)/outer_distinct)*((buckets_num -
inner_disttinct)/buckets_num))

Do you think this assumption is reasonable?

On Tue, Nov 19, 2019 at 3:46 PM Thomas Munro <thomas(dot)munro(at)gmail(dot)com> wrote:

> On Mon, Nov 18, 2019 at 7:48 PM Jinbao Chen <jinchen(at)pivotal(dot)io> wrote:
> > In the test case above, the small table has 3000 tuples and 100 distinct
> values on column ‘a’.
> > If we use small table as inner table. The chan length of the bucket is
> 30. And we need to
> > search the whole chain on probing the hash table. So the cost of probing
> is bigger than build
> > hash table, and we need to use big table as inner.
> >
> > But in fact this is not true. We initialized 620,000 buckets in
> hashtable. But only 100 buckets
> > has chains with length 30. Other buckets are empty. Only hash values
> need to be compared.
> > Its costs are very small. We have 100,000 distinct key and 100,000,000
> tuple on outer table.
> > Only (100/100000)* tuple_num tuples will search the whole chain. The
> other tuples
> > (number = (98900/100000)*tuple_num*) in outer
> > table just compare with the hash value. So the actual cost is much
> smaller than the planner
> > calculated. This is the reason why using a small table as inner is
> faster.
>
> So basically we think that if t_big is on the outer side, we'll do
> 100,000,000 probes and each one is going to scan a t_small bucket with
> chain length 30, so that looks really expensive. Actually only a
> small percentage of its probes find tuples with the right hash value,
> but final_cost_hash_join() doesn't know that. So we hash t_big
> instead, which we estimated pretty well and it finishes up with
> buckets of length 1,000 (which is actually fine in this case, they're
> not unwanted hash collisions, they're duplicate keys that we need to
> emit) and we probe them 3,000 times (which is also fine in this case),
> but we had to do a bunch of memory allocation and/or batch file IO and
> that turns out to be slower.
>
> I am not at all sure about this but I wonder if it would be better to
> use something like:
>
> run_cost += outer_path_rows * some_small_probe_cost;
> run_cost += hash_qual_cost.per_tuple * approximate_tuple_count();
>
> If we can estimate how many tuples will actually match accurately,
> that should also be the number of times we have to run the quals,
> since we don't usually expect hash collisions (bucket collisions, yes,
> but hash collisions where the key doesn't turn out to be equal, no*).
>
> * ... but also yes as you approach various limits, so you could also
> factor in bucket chain length that is due to being prevented from
> expanding the number of buckets by arbitrary constraints, and perhaps
> also birthday_problem(hash size, key space) to factor in unwanted hash
> collisions that start to matter once you get to billions of keys and
> expect collisions with short hashes.
>

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Rushabh Lathia 2019-11-19 10:00:17 Re: backup manifests
Previous Message Pavel Stehule 2019-11-19 09:16:47 Re: range_agg