Re: multivariate statistics / patch v7

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Cc: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Stephen Frost <sfrost(at)snowman(dot)net>, Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
Subject: Re: multivariate statistics / patch v7
Date: 2015-07-30 12:55:50
Message-ID: 55BA1ED6.4090302@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 07/30/2015 10:21 AM, Heikki Linnakangas wrote:
> On 05/25/2015 11:43 PM, Tomas Vondra wrote:
>> There are 6 files attached, but only 0002-0006 are actually part of the
>> multivariate statistics patch itself.
>
> All of these patches are huge. In order to review this in a reasonable
> amount of time, we need to do this in several steps. So let's see what
> would be the minimal set of these patches that could be reviewed and
> committed, while still being useful.
>
> The main patches are:
>
> 1. shared infrastructure and functional dependencies
> 2. clause reduction using functional dependencies
> 3. multivariate MCV lists
> 4. multivariate histograms
> 5. multi-statistics estimation
>
> Would it make sense to commit only patches 1 and 2 first? Would that be
> enough to get a benefit from this?

I agree that the patch can't be reviewed as a single chunk - that was
the idea when I split the original (single chunk) patch into multiple
smaller pieces.

And yes, I believe committing pieces 1&2 might be enough to get
something useful, which can then be improved by adding the "usual" MCV
and histogram stats on top of that.

> I have some doubts about the clause reduction and functional
> dependencies part of this. It seems to treat functional dependency as
> a boolean property, but even with the classic zipcode and city case,
> it's not always an all or nothing thing. At least in some countries,
> there can be zipcodes that span multiple cities. So zipcode=X does
> not completely imply city=Y, although there is a strong correlation
> (if that's the right term). How strong does the correlation need to
> be for this patch to decide that zipcode implies city? I couldn't
> actually see a clear threshold stated anywhere.
>
> So rather than treating functional dependence as a boolean, I think
> it would make more sense to put a 0.0-1.0 number to it. That means
> that you can't do clause reduction like it's done in this patch,
> where you actually remove clauses from the query for cost esimation
> purposes. Instead, you need to calculate the selectivity for each
> clause independently, but instead of just multiplying the
> selectivities together, apply the "dependence factor" to it.
>
> Does that make sense? I haven't really looked at the MCV, histogram
> and "multi-statistics estimation" patches yet. Do those patches make
> the clause reduction patch obsolete? Should we forget about the
> clause reduction and functional dependency patch, and focus on those
> later patches instead?

Perhaps. It's true that most real-world data sets are not 100% valid
with respect to functional dependencies - either because of natural
imperfections (multiple cities with the same ZIP code) or just noise in
the data (incorrect entries ...). And it's even mentioned in the code
comments somewhere, I guess.

But there are two main reasons why I chose not to extend the functional
dependencies with the [0.0-1.0] value you propose.

Firstly, functional dependencies were meant to be the simplest possible
implementation, illustrating how the "infrastructure" is supposed to
work (which is the main topic of the first patch).

Secondly, all kinds of statistics are "simplifications" of the actual
data. So I think it's not incorrect to ignore the exceptions up to some
threshold.

I also don't think this will make the estimates globally better. Let's
say you have 1% of rows that contradict the functional dependency - you
may either ignore them and have good estimates for 99% of the values and
incorrect estimates for 1%, or tweak the rule a bit and make the
estimates worse for 99% (and possibly better for 1%).

That being said, I'm not against improving the functional dependencies.
I already do have some improvements on my TODO - like for example
dependencies on more columns (not just A=>B but [A,B]=>C and such), but
I think we should not squash this into those two patches.

And yet another point - ISTM these cases might easily be handled better
by the statistics based on ndistinct coefficients, as proposed by
Kyotaro-san some time ago. That is, compute and track

ndistinct(A) * ndistinct(B) / ndistinct(A,B)

for all pairs of columns (or possibly larger groups). That seems to be
similar to the coefficient you propose.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alexander Korotkov 2015-07-30 13:26:49 64-bit XIDs again
Previous Message Alexander Korotkov 2015-07-30 12:33:06 Re: [PATCH] Microvacuum for gist.