Re: multivariate statistics / patch v7

From: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Cc: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Stephen Frost <sfrost(at)snowman(dot)net>, Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
Subject: Re: multivariate statistics / patch v7
Date: 2015-07-30 16:58:39
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 07/30/2015 03:55 PM, Tomas Vondra wrote:
> On 07/30/2015 10:21 AM, Heikki Linnakangas wrote:
>> I have some doubts about the clause reduction and functional
>> dependencies part of this. It seems to treat functional dependency as
>> a boolean property, but even with the classic zipcode and city case,
>> it's not always an all or nothing thing. At least in some countries,
>> there can be zipcodes that span multiple cities. So zipcode=X does
>> not completely imply city=Y, although there is a strong correlation
>> (if that's the right term). How strong does the correlation need to
>> be for this patch to decide that zipcode implies city? I couldn't
>> actually see a clear threshold stated anywhere.
>> So rather than treating functional dependence as a boolean, I think
>> it would make more sense to put a 0.0-1.0 number to it. That means
>> that you can't do clause reduction like it's done in this patch,
>> where you actually remove clauses from the query for cost esimation
>> purposes. Instead, you need to calculate the selectivity for each
>> clause independently, but instead of just multiplying the
>> selectivities together, apply the "dependence factor" to it.
>> Does that make sense? I haven't really looked at the MCV, histogram
>> and "multi-statistics estimation" patches yet. Do those patches make
>> the clause reduction patch obsolete? Should we forget about the
>> clause reduction and functional dependency patch, and focus on those
>> later patches instead?
> Perhaps. It's true that most real-world data sets are not 100% valid
> with respect to functional dependencies - either because of natural
> imperfections (multiple cities with the same ZIP code) or just noise in
> the data (incorrect entries ...). And it's even mentioned in the code
> comments somewhere, I guess.
> But there are two main reasons why I chose not to extend the functional
> dependencies with the [0.0-1.0] value you propose.
> Firstly, functional dependencies were meant to be the simplest possible
> implementation, illustrating how the "infrastructure" is supposed to
> work (which is the main topic of the first patch).
> Secondly, all kinds of statistics are "simplifications" of the actual
> data. So I think it's not incorrect to ignore the exceptions up to some
> threshold.

The problem with a threshold is that around that threshold, even a small
change in the data set can drastically change the produced estimates.
For example, imagine that we know from the stats that zip code implies
city. But then someone adds a single row to the table with an odd zip
code & city combination, which pushes the estimator over the threshold,
and the columns are no longer considered dependent, and the estimates
are now completely different. We should avoid steep cliffs like that.

BTW, what is the threshold in the current patch?

- Heikki

In response to


Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2015-07-30 17:14:38 Re: dblink: add polymorphic functions.
Previous Message Tom Lane 2015-07-30 16:51:07 Re: dblink: add polymorphic functions.