Re: PATCH: add support for IN and @> in functional-dependency statistics use

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Pierre Ducroquet <p(dot)psql(at)pinaraf(dot)info>
Cc: pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: PATCH: add support for IN and @> in functional-dependency statistics use
Date: 2020-02-02 18:41:34
Message-ID: 20200202184134.swoqkqlqorqolrqv@development
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sun, Feb 02, 2020 at 10:59:32AM +0100, Pierre Ducroquet wrote:
>On Saturday, February 1, 2020 3:24:46 PM CET Tomas Vondra wrote:
>> On Sat, Feb 01, 2020 at 08:51:04AM +0100, Pierre Ducroquet wrote:
>> >Hello
>> >
>> >At my current job, we have a lot of multi-tenant databases, thus with
>> >tables containing a tenant_id column. Such a column introduces a severe
>> >bias in statistics estimation since any other FK in the next columns is
>> >very likely to have a functional dependency on the tenant id. We found
>> >several queries where this functional dependency messed up the estimations
>> >so much the optimizer chose wrong plans.
>> >When we tried to use extended statistics with CREATE STATISTIC on
>> >tenant_id, other_id, we noticed that the current implementation for
>> >detecting functional dependency lacks two features (at least in our use
>> >case):
>> >- support for IN clauses
>> >- support for the array contains operator (that could be considered as a
>> >special case of IN)
>>
>> Thanks for the patch. I don't think the lack of support for these clause
>> types is an oversight - we haven't done them because we were not quite
>> sure the functional dependencies can actually apply to them. But maybe
>> we can support them, I'm not against that in principle.
>>
>> >After digging in the source code, I think the lack of support for IN
>> >clauses is an oversight and due to the fact that IN clauses are
>> >ScalarArrayOpExpr instead of OpExpr. The attached patch fixes this by
>> >simply copying the code- path for OpExpr and changing the type name. It
>> >compiles and the results are correct, with a dependency being correctly
>> >taken into consideration when estimating rows. If you think such a copy
>> >paste is bad and should be factored in another static bool function,
>> >please say so and I will happily provide an updated patch.
>>
>> Hmmm. Consider a query like this:
>>
>> ... WHERE tenant_id = 1 AND another_column IN (2,3)
>>
>> which kinda contradicts the idea of functional dependencies that knowing
>> a value in one column, tells us something about a value in a second
>> column. But that assumes a single value, which is not quite true here.
>>
>> The above query is essentially the same thing as
>>
>> ... WHERE (tenant_id=1 AND (another_column=2 OR another_column=3))
>>
>> and also
>>
>> ... WHERE (tenant_id=1 AND another_column=2)
>> OR (tenant_id=1 AND another_column=3)
>>
>> at wchich point we could apply functional dependencies - but we'd do it
>> once for each AND-clause, and then combine the results to compute
>> selectivity for the OR clause.
>>
>> But this means that if we apply functional dependencies directly to the
>> original clause, it'll be inconsistent. Which seems a bit unfortunate.
>>
>> Or do I get this wrong?
>
>In my tests, I've got a table with two columns a and b, generated this way:
> CREATE TABLE test (a INT, b INT)
> AS SELECT i, i/10 FROM
> generate_series(1, 100000) s(i),
> generate_series(1, 5) x;
>
>With statistics defined on the a, b columns
>
>Here are the estimated selectivity results without any patch:
>
>SELECT * FROM test WHERE a = 1 AND b = 1 : 5
>SELECT * FROM test WHERE a = 1 AND (b = 1 OR b = 2) : 1
>SELECT * FROM test WHERE (a = 1 AND b = 1) OR (a = 1 AND b = 2) : 1
>SELECT * FROM test WHERE a = 1 AND b IN (1, 2) : 1
>
>With the patch, the estimated rows of the last query goes back to 5, which is
>more logical. The other ones don't change.
>

Yes, I think you're right. I've been playing with this a bit more, and I
think you're right we can support it the way you propose.

I'm still a bit annoyed by the inconsistency this might/does introduce.
Consider for example these clauses:

a) ... WHERE a = 10 AND b IN (100, 200)
b) ... WHERE a = 10 AND (b = 100 OR b = 200)
c) ... WHERE (a = 10 AND b = 100) OR (a = 10 AND b = 200)

All three cases are logically equivalent and do return the same set of
rows. But we estimate them differently, arriving at different estimates.

Case (a) is the one you improve in your patch. Case (c) is actually not
possible in practice, because we rewrite it as (b) during planning.

But (b) is estimated very differently, because we don't recognize the OR
clause as supported by functional dependencies. On the one hand I'm sure
it's not the first case where we already estimate equivalent clauses
differently. On the other hand I wonder how difficult would it be to
support this specific type of OR clause (with all expressions having the
form "Var = Const" and all Vars referencing the same rel).

I'm not going to block the patch because of this, of course. Similarly,
it'd be nice to add support for ScalarArrayOpExpr to MCV stats, not just
functional dependencies ...

>> BTW the code added in the 0001 patch is the same as for is_opclause, so
>> maybe we can simply do
>>
>> if (is_opclause(rinfo->clause) ||
>> IsA(rinfo->clause, ScalarArrayOpExpr))
>> {
>> ...
>> }
>>
>> instead of just duplicating the code.
>
>I would love doing that, but the ScalarArrayOpExpr and OpExpr are not binary
>compatible for the members used here. In ScalarArrayOpExpr, on AMD64, args is
>at offset 24 and opno at 4, while they are at 32 and 4 in OpExpr. I can work
>around with this kind of code, but I don't like it much:
>List *args;
>Oid opno;
>if (IsA(rinfo->clause, OpExpr))
>{
> /* If it's an opclause, we will check for Var = Const or Const = Var. */
> OpExpr *expr = (OpExpr *) rinfo->clause;
> args = expr->args;
> opno = expr->opno;
>}
>else if (IsA(rinfo->clause, ScalarArrayOpExpr))
>{
> /* If it's a ScalarArrayOpExpr, check for Var IN Const. */
> ScalarArrayOpExpr *expr = (ScalarArrayOpExpr *) rinfo->clause;
> args = expr->args;
> opno = expr->opno;
>}
>

Oh, right. I'm dumb, I missed this obvious detail. I blame belgian beer.

>Or I can rewrite it in C++ to play with templates... :)
>

Please don't ;-)

>> We also need some at least some
>> regression tests, testing functional dependencies with this clause type.
>
>Agreed
>
>> >The lack of support for @> operator, on the other hand, seems to be a
>> >decision taken when writing the initial code, but I can not find any
>> >mathematical nor technical reason for it. The current code restricts
>> >dependency calculation to the = operator, obviously because inequality
>> >operators are not going to work... but array contains is just several =
>> >operators grouped, thus the same for the dependency calculation. The
>> >second patch refactors the operator check in order to also include array
>> >contains.
>>
>> No concrete opinion on this yet. I think my concerns are pretty similar
>> to the IN clause, although I'm not sure what you mean by "this could be
>> considered as special case of IN".
>
>I meant from a mathematical point of view.
>

Can you elaborate a bit? I still don't understand how it's just "several
equality operators grouped".

I think the challenge here is in applying the functional dependency
computed for the whole array to individual elements. I'm not sure we can
do that.

For example, with a table like this:

CREATE TABLE t (a int, b int[]);
CREATE STATISTICS s (dependencies) ON a, b FROM t;

Let's say the functional dependency is "perfect" i.e. has strength 1.0.
But that only tells us dependency for complete array values, we don't
know how much information we gain by knowledge of subset of the values.

For example, all the arrays may contain {1, 2, 3} as subset, and then
some "unique" element, like this:

INSERT INTO t SELECT i/1000, ARRAY[1,2,3, 4 + i/100]
FROM generate_series(1,1000000) s(i);

and then do a query like this:

select * from t where a = 10 and b @> ARRAY[1,2];

Without extended stats, it's estimated like this:

QUERY PLAN
---------------------------------------------------------------
Seq Scan on t (cost=0.00..24346.00 rows=997 width=41)
(actual time=1.391..140.261 rows=1000 loops=1)
Filter: ((b @> '{1,2}'::integer[]) AND (a = 10))
Rows Removed by Filter: 999000
Planning Time: 0.052 ms
Execution Time: 140.707 ms
(5 rows)

but the moment you create functional stats, you get this:

QUERY PLAN
---------------------------------------------------------------
Seq Scan on t (cost=0.00..24346.00 rows=1000000 width=41)
(actual time=1.432..143.047 rows=1000 loops=1)
Filter: ((b @> '{1,2}'::integer[]) AND (a = 10))
Rows Removed by Filter: 999000
Planning Time: 0.099 ms
Execution Time: 143.527 ms
(5 rows)

That doesn't seem very useful :-(

>> >I tested the patches on current HEAD, but I can test and provide
>> >back-ported versions of the patch for other versions if needed (this code
>> >path hardly changed since its introduction in 10).
>>
>> I think the chance of this getting backpatched is zero, because it might
>> easily break existing apps.
>
>I understand
>

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Ian Barwick 2020-02-03 00:14:23 Re: Prevent pg_basebackup running as root
Previous Message Adé 2020-02-02 18:06:01 [PATCH] Fix for slow GIN index queries when "gin_fuzzy_search_limit" setting is relatively small for large tables