Re: multivariate statistics / patch v7

From: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Cc: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Stephen Frost <sfrost(at)snowman(dot)net>, Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
Subject: Re: multivariate statistics / patch v7
Date: 2015-07-30 08:21:58
Message-ID: 55B9DEA6.6030603@iki.fi
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 05/25/2015 11:43 PM, Tomas Vondra wrote:
> There are 6 files attached, but only 0002-0006 are actually part of the
> multivariate statistics patch itself.

All of these patches are huge. In order to review this in a reasonable
amount of time, we need to do this in several steps. So let's see what
would be the minimal set of these patches that could be reviewed and
committed, while still being useful.

The main patches are:

1. shared infrastructure and functional dependencies
2. clause reduction using functional dependencies
3. multivariate MCV lists
4. multivariate histograms
5. multi-statistics estimation

Would it make sense to commit only patches 1 and 2 first? Would that be
enough to get a benefit from this?

I have some doubts about the clause reduction and functional
dependencies part of this. It seems to treat functional dependency as a
boolean property, but even with the classic zipcode and city case, it's
not always an all or nothing thing. At least in some countries, there
can be zipcodes that span multiple cities. So zipcode=X does not
completely imply city=Y, although there is a strong correlation (if
that's the right term). How strong does the correlation need to be for
this patch to decide that zipcode implies city? I couldn't actually see
a clear threshold stated anywhere.

So rather than treating functional dependence as a boolean, I think it
would make more sense to put a 0.0-1.0 number to it. That means that you
can't do clause reduction like it's done in this patch, where you
actually remove clauses from the query for cost esimation purposes.
Instead, you need to calculate the selectivity for each clause
independently, but instead of just multiplying the selectivities
together, apply the "dependence factor" to it.

Does that make sense? I haven't really looked at the MCV, histogram and
"multi-statistics estimation" patches yet. Do those patches make the
clause reduction patch obsolete? Should we forget about the clause
reduction and functional dependency patch, and focus on those later
patches instead?

- Heikki

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Rajeev rastogi 2015-07-30 09:55:11 Re: Autonomous Transaction is back
Previous Message Michael Paquier 2015-07-30 08:14:16 Re: Don'st start streaming after creating a slot in pg_receivexlog