Re: PATCH: add support for IN and @> in functional-dependency statistics use

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>
Cc: Pierre Ducroquet <p(dot)psql(at)pinaraf(dot)info>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: PATCH: add support for IN and @> in functional-dependency statistics use
Date: 2020-03-13 16:09:44
Message-ID: 20200313160944.aqrbkyf7tkasoxfk@development
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Mar 13, 2020 at 08:42:49AM +0000, Dean Rasheed wrote:
>On Thu, 12 Mar 2020 at 17:30, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>>
>> I'm sorry, but I don't see how we could do this for arbitrary clauses. I
>> think we could do that for clauses that have equality semantics and
>> reference column values as a whole. So I think it's possible to do this
>> for IN clauses (which is what the first part of the patch does), but I
>> don't think we can do it for the containment operator.
>>
>> I.e. we can do that for
>>
>> WHERE a IN (...) AND b IN (...)
>>
>
>Hmm, the difficulty always comes back to the compatibility of the
>clauses though. It's easy to come up with artificial examples for
>which functional dependencies come up with bad estimates, even with
>just = and IN (...) operators. For example, given a perfect
>correlation like
>
> a | b
> -------
> 1 | 1
> 2 | 2
> 3 | 3
> : | :
>
>you only need to write a query like "WHERE a IN (1,3,5,7,9,...) AND b
>IN (2,4,6,8,...)" to get a very bad estimate from functional
>dependencies.
>
>However, I don't think such artificial examples are that useful. I
>think you have to think in terms of real data distributions together
>with real queries expected to go with them. For example:
>
>Using the OP's original example of a multi-tenant system, you might
>well have a table with columns (product_type, tenant_id) and a
>functional dependency product_type => tenant_id. In that case, it
>could well be very useful in optimising queries like "WHERE
>product_type IN (X,Y,Z) AND tenant_id = 123".
>
>But this isn't necessarily limited to = and IN (...). For example,
>consider a table with UK-based geographic data with columns (location
>point, postcode text). Then there would be a strong functional
>dependency location => postcode (and possibly also the other way
>round, depending on how dense the points were). That dependency could
>be used to estimate much more general queries like "WHERE location <@
>some_area AND postcode ~ '^CB.*'", where there may be no useful stats
>on location, but a histogram on postcode might give a reasonable
>estimate.
>
>This also extends to inequalities. For example a table with columns
>(weight, category) might have a strong functional dependency weight =>
>category. Then a query like "WHERE weight > 10 AND weight < 20 AND
>category = 'large'" could get a decent estimate from a histogram on
>the weight column, and then use the functional dependency to note that
>that implies the category. Note that such an example would work with
>my patch from the other thread, because it groups clauses by column,
>and uses clauselist_selectivity_simple() on them. So in this case, the
>clauses "weight > 10 AND weight < 20" would be estimated together, and
>would be able to make use of the RangeQueryClause code.
>
>Of course, it's equally easy to come up with counter-example queries
>for any of those examples, where using the functional dependency would
>produce a poor estimate. Ultimately, it's up to the user to decide
>whether or not to build functional dependency statistics, and that
>decision needs to be based not just on the data distribution, but also
>on the types of queries expected.
>

Well, yeah. I'm sure we can produce countless examples where applying
the functional dependencies to additional types of clauses helps a lot.
I'm somewhat hesitant to just drop any restrictions, though, because
it's equally simple to produce examples with poor results.

The main issue I have with just applying dependencies to arbitrary
clauses is it uses "degree" computed for the value as a whole, and
just uses it to estimate dependency between pieces of the values.

The IN() clause does not have this problem, the other cases like @> or
pattern matching do.

>Given the timing though, perhaps it is best to limit this to IN (..)
>clauses for PG13, and we can consider other possibilities later.
>

Yeah, I was gonna propose the same thing. I'll get the IN bit committed
shortly (both for dependencies and MCV), improved handling of OR clauses
and some additional regression tests to increase the coverage.

Then we can discuss these improvement in more detail for PG14.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2020-03-13 16:14:52 Re: [PATCH] Use PKG_CHECK_MODULES to detect the libxml2 library
Previous Message Tom Lane 2020-03-13 16:06:24 Re: [PATCH] Use PKG_CHECK_MODULES to detect the libxml2 library