Re: [HACKERS] [PATCH] Overestimated filter cost and its mitigation

From: Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
To: Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>
Cc: Yuto Hayamizu <y(dot)hayamizu(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Robert Haas <robertmhaas(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] [PATCH] Overestimated filter cost and its mitigation
Date: 2018-07-27 08:49:27
Message-ID: CAFjFpRcex4Q4uSGtPMRkpxidg1QfSj7fChmGLyBLjj9A329uDw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Jul 26, 2018 at 10:30 PM, Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com> wrote:
> On 26 July 2018 at 07:12, Ashutosh Bapat
> <ashutosh(dot)bapat(at)enterprisedb(dot)com> wrote:
>> In the patch clauselist_selectivity() gets called repeatedly for every
>> new qual added to the clause list. Instead, if we try to combine the
>> cost/row estimation with order_qual_clauses() or
>> clauselist_selectivity(), we might be able to what the current patch
>> does in O(n). clauselist_selectivity() calculates combined selectivity
>> of all the given quals in O(n). We should do something similar to that
>> in this case.
>
> Duplicating the logic of clauselist_selectivity() seems like quite a
> lot of effort to go to just for improved filter cost estimates. Note
> also that clauselist_selectivity() might get a good deal more complex
> with multivariate stats.

I am not suggesting to duplicate code in clauselist_selectivity().
Instead I am suggesting that we incorporate costing in that function.
I don't know how feasible is that.

>
> Perhaps a reasonable simplification would be to just treat the clauses
> as independent when estimating the filter cost, and so use the
> following as a "good enough" estimate for the filter cost:
>
> cost(A) + sel(A) * cost(B) + sel(A) * sel(B) * cost(C) + ...
>
> That would probably be an improvement over what we currently have. It
> would be O(n) to compute, and it would probably use the cached
> selectivity estimates for the clauses.

That looks like a good trade-off. But looking at the following comment
in clause_selectivity(), I doubt if this will work in all the cases
/*
* If possible, cache the result of the selectivity calculation for
* the clause. We can cache if varRelid is zero or the clause
* contains only vars of that relid --- otherwise varRelid will affect
* the result, so mustn't cache. Outer join quals might be examined
* with either their join's actual jointype or JOIN_INNER, so we need
* two cache variables to remember both cases. Note: we assume the
* result won't change if we are switching the input relations or
* considering a unique-ified case, so we only need one cache variable
* for all non-JOIN_INNER cases.
*/

>
> Note also that with this simplified expression for the filter cost, it
> would be possible to improve order_qual_clauses() to take into account
> the clause selectivities when sorting the clauses, and minimise the
> cost of the filter by evaluating more selective clauses sooner, if
> they're not too expensive. I believe that amounts to sorting by
> cost/(1-sel) rather than just cost, except for clauses with sel==1,
> which it makes sense to move to the end, and just sort by cost.

+1, if we could do that.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Philip Scott 2018-07-27 10:44:28 Auditing via logical decoding
Previous Message Tsunakawa, Takayuki 2018-07-27 08:27:26 RE: Temporary tables prevent autovacuum, leading to XID wraparound