Quick Links

Re: MergeJoin beats HashJoin in the case of multiple hash clauses

From:	Andy Fan <zhihui(dot)fan1213(at)gmail(dot)com>
To:	Andrey Lepikhov <a(dot)lepikhov(at)postgrespro(dot)ru>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject:	Re: MergeJoin beats HashJoin in the case of multiple hash clauses
Date:	2023-09-11 04:51:03
Message-ID:	CAKU4AWoKiogNir=ESrm6F9pUwYFDfmsoNG9w9k1MG3VUbKM0Dw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Hi,

On Thu, Jun 15, 2023 at 4:30 PM Andrey Lepikhov <a(dot)lepikhov(at)postgrespro(dot)ru>
wrote:

> Hi, all.
>
> Some of my clients use JOIN's with three - four clauses. Quite
> frequently, I see complaints on unreasonable switch of JOIN algorithm to
> Merge Join instead of Hash Join. Quick research have shown one weak
> place - estimation of an average bucket size in final_cost_hashjoin (see
> q2.sql in attachment) with very conservative strategy.
> Unlike estimation of groups, here we use smallest ndistinct value across
> all buckets instead of multiplying them (or trying to make multivariate
> analysis).
> It works fine for the case of one clause. But if we have many clauses,
> and if each has high value of ndistinct, we will overestimate average
> size of a bucket and, as a result, prefer to use Merge Join. As the
> example in attachment shows, it leads to worse plan than possible,
> sometimes drastically worse.
> I assume, this is done with fear of functional dependencies between hash
> clause components. But as for me, here we should go the same way, as
> estimation of groups.
>

I can reproduce the visitation you want to improve and verify the patch
can do it expectedly. I think this is a right thing to do.

> The attached patch shows a sketch of the solution.
>

I understand that this is a sketch of the solution, but the below changes
still
make me confused.

+ if (innerbucketsize > virtualbuckets)
+ innerbucketsize = 1.0 / virtualbuckets;

innerbucketsize is a fraction of rows in all the rows, so it is between 0.0
and 1.0.
and virtualbuckets is the number of buckets in total (when considered the
mutli
batchs), how is it possible for 'innerbucketsize > virtualbuckets' ? Am
I missing something?

--
Best Regards
Andy Fan

In response to

MergeJoin beats HashJoin in the case of multiple hash clauses at 2023-06-15 08:30:10 from Andrey Lepikhov

Responses

Re: MergeJoin beats HashJoin in the case of multiple hash clauses at 2023-09-11 08:04:22 from Lepikhov Andrei

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Pavel Stehule	2023-09-11 04:57:12	Re: proposal: possibility to read dumped table's name from file
Previous Message	Pavel Stehule	2023-09-11 04:34:56	Re: proposal: possibility to read dumped table's name from file