Re: multivariate statistics / patch v7

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Cc: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Stephen Frost <sfrost(at)snowman(dot)net>, Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
Subject: Re: multivariate statistics / patch v7
Date: 2015-07-30 21:28:50
Message-ID: 55BA9712.9090608@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 07/30/2015 06:58 PM, Heikki Linnakangas wrote:
>
> The problem with a threshold is that around that threshold, even a
> small change in the data set can drastically change the produced
> estimates. For example, imagine that we know from the stats that zip
> code implies city. But then someone adds a single row to the table
> with an odd zip code & city combination, which pushes the estimator
> over the threshold, and the columns are no longer considered
> dependent, and the estimates are now completely different. We should
> avoid steep cliffs like that.
>
> BTW, what is the threshold in the current patch?

There's not a simple threshold - the algorithm mining the functional
dependencies is a bit more complicated. I tried to explain it in the
comment before build_mv_dependencies (in dependencies.c), but let me
briefly summarize it here.

To mine dependency [A => B], build_mv_dependencies does this:

(1) sort the sample by {A,B}

(2) split the sample into groups with the same value of A

(3) for each group, decide if it's consistent with the dependency

(a) if the group is too small (less than 3 rows), ignore it

(a) if the group is consistent, update

n_supporting
n_supporting_rows

(b) if the group is inconsistent, update

n_contradicting
n_contradicting_rows

(4) decide whether the dependency is "valid" by checking

n_supporting_rows >= n_contradicting_rows * 10

The limit is rather arbitrary and yes - I can imagine a more complex
condition (e.g. looking at average number of tuples per group etc.), but
I haven't looked into that - the point was to use something very simple,
only to illustrate the infrastructure.

I think we might come up with some elaborate way of associating "degree"
with the functional dependency, but at that point we really loose the
simplicity, and also make it indistinguishable from the remaining
statistics (because it won't be possible to reduce the clauses like
this, before performing the regular estimation). Which is exactly what
makes the functional dependencies so neat and efficient, so I'm not
overly enthusiastic about doing that.

What seems more interesting is implementing the ndistinct coefficient
instead, as proposed by Kyotaro-san - that seems to have the nice
"smooth" behavior you desire, while keeping the simplicity.

Both statistics types (functional dependencies and ndistinct coeff) have
one weak point, though - they somehow assume the queries use
"compatible" values. For example if you use a query with

WHERE city = 'New York' AND zip = 'zip for Detroit'

they can't detect cases like this, because those statistics types are
oblivious to individual values. I don't see this as a fatal flaw, though
- it's rather a consequence of the nature of the stats. And I tend to
look at the functional dependencies the same way.

If you need stats without these "issues" you'll have to use MCV list or
a histogram. Trying to fix the simple statistics types is futile, IMHO.

regards
Tomas

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Josh Berkus 2015-07-30 21:29:46 Re: 64-bit XIDs again
Previous Message Tom Lane 2015-07-30 21:24:48 Re: brin index vacuum versus transaction snapshots